Systems and Methods for Identifying and Expressing Gene Clusters

ABSTRACT

Methods for identifying biosynthetic gene clusters that include genes for producing compounds that interact with specific target proteins are disclosed. Some methods relate to bioinformatics methods for identifying and/or prioritizing biosynthetic gene clusters. Related systems, components, and tools for the identification and expression of such gene clusters are also disclosed.

CROSS-REFERENCE

This application claims the benefit of U.S. Provisional Application No.62/423,196, filed Nov. 16, 2016, U.S. Provisional Application No.62/481,601, filed Apr. 4 2017, and U.S. Ser. No. 15/469,452, filed Mar.24, 2017, which applications are incorporated herein by reference.

STATEMENT AS TO FEDERALLY SPONSORED RESEARCH

This invention was made with Government support under U01 GM110706awarded by the National Institutes of Health. The government has certainrights to the invention.

REFERENCE TO A SEQUENCE LISTING SUBMITTED ELECTRONICALLY VIA EFS-WEB

The instant application contains a Sequence Listing which has been filedelectronically in ASCII format and is hereby incorporated by referencein its entirety. Said ASCII copy, created on Nov. 9, 2017, is named52592-702_601_SL.txt and is 1,208,543 bytes in size.

TECHNICAL FIELD

The present disclosure generally relates to the introduction of geneclusters into host organisms for the manufacture of small molecules. Insome cases, the small molecules are analogs to products produced by theorganism in which the gene cluster is identified and that modulate aspecific protein. More specifically, the disclosure also relates to theidentification of gene clusters likely to produce products that target aspecific target protein and methods of expressing gene clusters in hostcells. Additionally sequences of various gene dusters are provided,together with structures of compounds produced from these gene clusters.

BACKGROUND

“Secondary metabolites,” as used herein, are small molecules that can beproduced by the expression of one or more gene clusters. Often,secondary metabolites are not critical for the survival of the organismin which the gene cluster is natively found. Secondary metabolites canbe clinically valuable small molecules. Examples include: antibacterialproducts such as penicillin, and daptomycin; antifungal products such asamphotericin; cholesterol-lowering products such as lovastatin;anticancer products such as taxol, and eribulin; and immune-modulatingproducts including rapamycin, and cyclosporine. In spite of the greatsuccess of secondary metabolites in the history of drug discovery,challenges of secondary metabolites in drug discovery and developmentcan include (i) extremely low yields, (ii) limited supply, (iii) complexstructures posing difficulty for structural modifications, and (iv)complex structures precluding practical synthesis. These difficultieshave prompted the pharmaceutical industry to embrace new technologies inpast decades, particularly combinatorial chemistry, as an alternative tonatural product discovery. As a result, the percentage of new secondarymetabolites being tested for use in medical treatment of humans hasdeclined steadily since the 1940s due to a greater reliance on syntheticlibraries that can be utilized in high throughput screening. Despite thepharmaceutical industry's preference for synthetic libraries, secondarymetabolites possess enormous structural and chemical diversity that isunsurpassed by synthetic libraries. Most importantly, secondarymetabolites are often evolutionarily optimized as drug-like molecules totarget specific proteins and/or pathways.

Now, thousands of bacterial and fungal genomes have been sequenced.These organisms are known to be rich sources of secondary metabolites.These secondary metabolites are enzymatically biosynthesized by theproducts of one or more genes, often grouped into gene clusters. Newgenome sequences have revealed that traditional approaches have tappedonly a fraction of the biosynthetic potential of these organisms as, onaverage, fewer than 10% of the biosynthetic gene clusters (BGCs) in amicrobial genome are expressed in any single culture condition. Further,millions of fungal species are believed to exist in nature, but have notbeen cultured in the laboratory. Accordingly, the supply bottleneck forsecondary metabolites can be reduced by introducing the genes utilizedin the synthesis of the secondary metabolite into a microorganism thatcan overproduce a desired secondary metabolite or analog thereof. Inthis way, the vast, untapped, ecological biodiversity of microbes holdsrenewed promise for the discovery of novel secondary metabolites usefulin one or more contexts, such as the treatment of disease.

SUMMARY OF THE INVENTION

In some embodiments, this disclosure refers to a method for screening aplurality of compounds, the method comprising: identifying a genecluster that includes or is within 20 kilobases of a region that encodesfor a protein that is identical with or homologous to a first targetprotein, wherein the gene cluster comprises a region that encodes for aprotein selected from the group consisting of (1) polyketide synthasesand (2) non-ribosomal peptide synthetases, (3) terpene synthetases, (4)UbiA-type terpene cyclases, and (5) dimethylallyl transferases;introducing a plurality of genes from the gene cluster into a vector;introducing the vector into a host cell; expressing the proteins encodedby the plurality of genes in the host cell; and determining whether acompound that is formed or modified by the expressed proteins modulatesthe first target protein. In some cases, the host cell is a yeast cell.In some cases, the yeast cell is a yeast cell that has been modified toincreased sporulation frequency and increased mitochondrial stability.In some cases, each gene of the plurality of genes is under the controlof a different promoter. In some cases, the promoters are designed toincrease expression when the host cell is in the presence anonfermentable carbon source. In some cases, the plurality of genes areintroduced into the vector via homologous recombination. In some cases,introducing the plurality of genes into the vector via homologousrecombination comprises combining a first plurality of nucleotides witha second plurality of nucleotides, wherein: each polynucleotide of thefirst plurality of DNA polynucleotides encodes for a promoter andterminator, wherein each promoter and terminator is distinct from thepromoter and terminator of other polynucleotides of the first pluralityof nucleotides; and each polynucleotide of the second plurality ofnucleotides includes a coding sequence, a first flanking region on the5′ side of the polynucleotide and a second flanking region on the 3′side of the polynucleotide; and introducing the polynucleotides into ahost cell that includes machinery for homologous recombination, whereinthe host cell assembles the expression vector via homologousrecombination that occurs in the flanking regions of the secondplurality of polynucleotides; wherein the expression vector isconfigured to facilitate simultaneous production of a plurality ofproteins encoded by the second plurality of nucleotides. In some cases,the first flanking region and the second flanking region are eachbetween 15 and 75 base pairs in length. In some cases, the firstflanking region and the second flanking region are each between 40 and60 base pairs in length. In some embodiments, the present disclosureprovides a method for identifying a gene cluster capable of producing asmall molecule for modulating a first target protein, the methodcomprising: selecting, from a database comprising a list of biosyntheticgene clusters, one or more gene clusters that include or are positionedproximal to a region that encodes a protein that is identical with orhomologous to the first target protein. In some cases, the one or moregene clusters are selected from the group consisting of (1) clustersthat comprise one or more polyketide synthases and (2) clusters thatcomprise one or more non-ribosomal peptide synthetases, (3) clustersthat comprise one or more terpene synthases, (4) clusters that comprisesone or more UbiA-type terpene cyclases, and (5) clusters that compriseone or more dimethylallyl transferases. In some cases, the protein thatis encoded by the region that is included in or positioned proximal tothe biosynthetic gene cluster is identical to or has greater than 30%homology to the first target protein. In some cases, the region thatencodes the protein that is identical with or homologous to the firsttarget protein is within 20,000 base pairs of a region of a portion ofthe gene cluster that encodes a polyketide synthase, a non-ribosomalpeptide synthetase, a terpene synthetase, a UbiA-type terpene cyclase,or a dimethylallyl transferase. In some cases, the first target proteinis BRSK1. In some cases, selecting the one or more gene clusterscomprises operating a computer, wherein operation of the computercomprises running an algorithm that takes into account both an inputsequence for the first target protein and sequence information from adatabase that includes sequence information from a plurality of speciessuch that the computer returns information corresponding to one or moregene clusters. In some cases, the algorithm takes into account thephylogenetic relationship between gene clusters in the database. In somecases, the one or more gene clusters include a coding sequence for aprotein that is an extracellular protein, a membrane-tethered protein, aprotein involved in a transport or secretion pathway, a proteinhomologous to a protein involved in a transport or secretion pathway, aprotein with a peptide targeting signal, a protein with a terminalsequence with homology to a targeting signal, an enzyme that degradessmall molecules, or a protein with homology to an enzyme that degradessmall molecules. In some embodiments the present disclosure provides amethod for producing a compound that modulates the first target protein,the method comprising: identifying a gene cluster (e.g., via methodsdisclosed herein), expressing the gene cluster or a plurality of genesfrom the gene cluster in a host cell, and isolating a compound producedby the gene cluster. In some cases, the method further comprisesscreening the isolated compound for modulation of an activity of thefirst target protein. In some cases, the cluster-encoded protein that ishomologous to the first target protein is resistant to modulation by theisolated compound when compared to modulation of the first targetprotein. In some cases, the compound is not toxic to the species fromwhich the cluster originates due to one or more of (1) sequencedifferences between target protein and the cluster-encoded protein, (2)spatial separation of the compound from the cluster-encoded protein and(3) high expression levels for the cluster-encoded protein.

In some embodiments the present disclosure provides a method for makinga DNA vector, the method comprising: identifying a gene clustercomprising a plurality of genes that are capable of producing a smallmolecule for modulating a first target protein; introducing two or moregenes of the plurality of genes into a vector, wherein the vector isconfigured to facilitate expression of the two or more genes in a hostorganism; wherein (1) the gene cluster encodes one or more proteinsselected from the group consisting of a polyketide synthase, anon-ribosomal peptide synthetase, a terpene synthetase, a UbiA-typeterpene cyclase, and a dimethylallyl transferase and (2) the genecluster includes or is positioned proximal to a region that encodes aprotein that is identical with or homologous to the first targetprotein. In some cases, the DNA vector is a circular plasmid. In somecases, the DNA vector comprises a plurality of promoters, wherein eachpromoter of the plurality of promoters is configured to, when the vectoris introduced into a Saccharomyces cerevisiae cell, promote a lowerlevel of heterologous expression when the cell exhibits predominantlyanaerobic energy metabolism than when the cell exhibits aerobic energymetabolism. In some cases, each promoter of the plurality of promotersdiffers in sequence from one another. In some cases, each promoter ofthe plurality of promoters has a sequence selected from the groupconsisting of: SEQ ID NOs: 1-66. In some cases, each promoter of theplurality of promoters has a sequence selected from the group consistingof: SEQ ID NOs: 20-35, and SEQ ID NOs: 41-50. In some cases, when theSaccharomyces cerevisiae cell is exhibiting anaerobic energy metabolism,the cell is catabolizing a fermentable carbon source selected fromglucose or dextrose; and when the Saccharomyces cerevisiae cell isexhibiting aerobic energy metabolism, the cell is catabolizing anon-fermentable carbon source selected from ethanol or glycerol.

In some embodiments the present disclosure provides a method for theheterologous expression of a plurality of genes in a yeast strain, themethod comprising: obtaining a yeast strain that includes a vector forexpressing a plurality of genes from a single gene cluster of anon-yeast organism; and inducing expression of the plurality of genes;wherein the gene cluster in the non-yeast organism includes or ispositioned proximal to a region that encodes a protein that is at least30% homologous to a target protein. In some cases, the method furthercomprises introducing the plurality of genes from the single genecluster into the vector. In some cases, the method further comprisesintroducing the vector into the yeast strain. In some cases, expressionof the plurality of genes results in the formation of small molecule,wherein the small molecule modulates the activity of the target protein.In some cases, the gene cluster is a gene cluster of a non-yeast fungus.In some cases, the yeast strain is from Saccharomyces cerevisiae. Insome cases, the target protein is a human protein.

In some embodiments the present disclosure provides a system foridentifying a gene cluster for introduction of a plurality of genes fromthe gene cluster into a host organism, the system comprising: aprocessor; a non-transitory computer-readable medium comprisinginstructions that, when executed by the processor, cause the processorto perform operations, the operations comprising: loading the identityor sequence of a first target protein into memory; loading the identityor sequence of a plurality of biosynthetic gene clusters into memory;identifying, from the plurality of biosynthetic gene clusters, one ormore gene clusters that encode or are positioned proximal to a regionthat encodes a protein that is identical with or homologous to the firsttarget protein; and scoring the one or more gene clusters based on thelikelihood of each gene cluster being capable of use to produce a smallmolecule that modulates the first target protein. In some cases, scoringthe one or more gene clusters comprises comparing the sequence of thefirst target protein (or a DNA sequence encoding the first targetprotein) to a sequence of a protein encoded in or proximal to the genecluster (or to a DNA sequence encoding the protein that is in orproximal to the gene cluster).

In some embodiments the present disclosure provides a system foridentifying one or more biosynthetic gene clusters for introduction intoa host organism to produce one or more compounds that modulate aspecific target protein, the system comprising: a processor; a memorycontaining a gene cluster identification application; wherein the genecluster identification application directs the processor to: load datadescribing at least one target protein into the memory; load datadescribing a plurality of biosynthetic gene clusters into the memory;score each of the plurality of biosynthetic gene clusters based upon:performing a homolog search for each biosynthetic gene cluster todetermine a presence of at least one homolog of a target protein withinor adjacent the biosynthetic gene cluster; confidence of homology of theat least one target protein to at least one gene in a biosynthetic genecluster; a fraction of a homologous gene that meets an identitythreshold; a total number of genes homologous to the at least one targetprotein present in the entire genome of an organism; homology of the atleast one homolog of at least one target protein within or adjacent thebiosynthetic gene cluster to genes in the target protein's genome;phylogenetic relationship of the at least one target protein to a genein a cluster; expected number of homologs of the at least one targetprotein in or adjacent to a biosynthetic cluster; and a likelihood thatat least one target protein is essential for cellular process in thenatural environment; and output a report identifying one or morebiosynthetic gene clusters that are most likely to produce a compoundthat modulates the at least one target protein.

In some embodiments the present disclosure provides a method forselecting a biosynthetic gene cluster that produces a secondarymetabolite, the method comprising: obtaining a list of gene clusters;performing a phylogenetic analysis of the genes within the clusterscompared to known genes from known biosynthetic gene clusters; andselecting the biosynthetic gene cluster based on its phylogeneticrelationship with the known genes. In some cases, the biosynthetic genecluster with the most distant phylogenetic relationship from the knowngenes is selected.

In some embodiments the present disclosure provides a method foridentifying a gene cluster that produces a compound that binds a proteinof interest, the method comprising: obtaining sequence information for aplurality of contiguous sequences, wherein each contiguous sequenceincludes a biosynthetic gene cluster and flanking genomic sequences;analyzing the contiguous sequences for the presence of a gene thatencodes a protein with homology to the protein of interest, andselecting a biosynthetic gene cluster which includes, or is proximal to,a gene that encodes a protein that is homologous to the protein ofinterest. In some cases, the contiguous nucleotide sequence is less than40,000 base pairs in length.

In some embodiments, the present disclosure provides a modified yeastcell having a BY background, wherein relative to unmodified BY4741 andBY4742, the modified yeast cell has both (1) increased sporulationfrequency and (2) increased mitochondrial stability. In some cases, themodified yeast cell grows faster on non-fermentable carbon sources thanunmodified BY4741 and BY4742. In some cases, the yeast cell comprisesone or more of the following genotypes: MKT1(30G), RME1(INS-308A), andTAO3(1493Q). In some cases, the yeast cell comprises one or more of thefollowing genotypes: CAT5(91M), MIP1(661T), SAL1+, and HAP1+. In someembodiments the present disclosure provides a method of forming anexpression vector, the method comprising: combining a first plurality ofDNA polynucleotides with a second plurality of polynucleotides, wherein:each polynucleotide of the first plurality of DNA polynucleotidesencodes for a promoter and terminator, wherein each promoter andterminator is distinct from the promoter and terminator of otherpolynucleotides of the first plurality of nucleotides; and eachpolynucleotide of the second plurality of nucleotides includes a codingsequence, a first flanking region on the 5′ side of the polynucleotideand a second flanking region on the 3′ side of the polynucleotide,wherein each flanking region is between 15 and 75 base pairs in length;and introducing the polynucleotides into a host cell that includesmachinery for homologous recombination, wherein the host cell assemblesthe expression vector via homologous recombination that occurs in theflanking regions of the second plurality of polynucleotides; wherein theexpression vector is configured to facilitate simultaneously productionof a plurality of proteins encoded by the second plurality ofnucleotides. In some cases, the host cell is a yeast cell. In somecases, each flanking region is between 40 and 60 base pairs in length.In some cases, at least one polynucleotide of the first plurality ofnucleotides encodes a selection marker.

In some embodiments, the present disclosure provides a system forgenerating a synthetic gene cluster via homologous recombination, thesystem comprising 1 though N unique promoter sequences, 1 through Nunique terminator sequences, and 1 through N unique coding sequences,wherein: coding sequence 1 is attached to an additional 30-70 base pairsequence on each end such that a first end portion is identical to thelast 30-70 base pairs of promoter 1 and a second end portion isidentical to the first 30-70 base pairs of terminator 1; coding sequence2 is attached to an additional 30-70 base pair sequence on each end suchthat a first end portion is identical to the last 30-70 base pairs ofpromoter 2 and a second end portion is identical to the first 30-70 basepairs of terminator 2; and coding sequence N is attached to anadditional 30-70 base pair sequence on each end such that a first endportion is identical to the last 30-70 base pairs of promoter N and asecond end portion is identical to the first 30-70 base pairs ofterminator N. In some cases, terminator 1 and promoter 2 are portions ofthe same double-stranded oligonucleotide.

In some embodiments, the present disclosure provides a method forassembling a synthetic gene cluster, the method comprising: obtaining 1through N unique promoters, 1 through N unique terminators, and 1through N unique coding sequences, wherein: coding sequence 1 isattached to an additional 30-70 base pair sequence on each end such thata first end portion is identical to the last 30-70 base pairs ofpromoter 1 and a second end portion is identical to the first 30-70 basepairs of terminator 1; coding sequence 2 is attached to an additional30-70 base pair sequence on each end such that a first end portion isidentical to the last 30-70 base pairs of promoter 2 and a second endportion is identical to the first 30-70 base pairs of terminator 2; andcoding sequence N is attached to an additional 30-70 base pair sequenceon each end such that a first end portion is identical to the last 30-70base pairs of promoter N and a second end portion is identical to thefirst 30-70 base pairs of terminator N; transforming the 1 through Npromoters, terminators and coding sequences into a yeast cell; isolatinga plasmid containing the 1 through N promoters, terminators and codingsequences from the yeast cell. In some cases, the method furthercomprises a coding sequence for a selection marker. In some cases, theselection marker is an auxotrophic marker. In some cases, the yeast cellhas a deficiency in a DNA ligase gene.

In some embodiments the present disclosure provides a yeast strain whichallows for both (1) homologous DNA assembly and (2) production ofheterologous genes in the same strain. In some cases, the yeast strainis a DHY strain. In some cases, the strain allows DNA assembly viahomologous recombination with an efficiency of at least 80% as comparedto DNA assembly in BY. In some cases, production of heterologouscompounds in the strain is accomplished with an efficiency of at least80% as compared to heterologous compound production in BJ5464. In somecases, the strain allows production of heterologous proteins with anefficiency of at least 80% as compared to heterologous proteinproduction in BJ5464.

In some embodiments the present disclosure provides a method forisolating a plasmid from a yeast cell, the method comprising: isolatingtotal DNA from a yeast cell that includes a plasmid; incubating the DNAwith an exonuclease such that the exonuclease degrades substantially allof the linear DNA in the isolated total DNA from the yeast cell;optionally inactivating the exonuclease; and recovering the plasmid DNA.In some cases, the isolated plasmid DNA is of sufficient purity for usein a sequencing reaction. In some cases, the plasmid DNA is furtherprepared for a sequencing reaction.

In some embodiments, the present disclosure provides pharmaceuticalcomposition comprising Compound 6 and a pharmaceutically acceptableexcipient. In some embodiments, the present disclosure providespharmaceutical composition comprising Compound 7 and a pharmaceuticallyacceptable excipient. In some embodiments, the present disclosureprovides pharmaceutical composition comprising Compound 8 and apharmaceutically acceptable excipient. In some embodiments, the presentdisclosure provides pharmaceutical composition comprising Compound 9 anda pharmaceutically acceptable excipient. In some embodiments, thepresent disclosure provides pharmaceutical composition comprisingCompound 10 and a pharmaceutically acceptable excipient. In someembodiments, the present disclosure provides pharmaceutical compositioncomprising Compound 11 and a pharmaceutically acceptable excipient. Insome embodiments, the present disclosure provides pharmaceuticalcomposition comprising Compound 12 and a pharmaceutically acceptableexcipient. In some embodiments, the present disclosure providespharmaceutical composition comprising Compound 13 and a pharmaceuticallyacceptable excipient. In some embodiments, the present disclosureprovides pharmaceutical composition comprising Compound 14 and apharmaceutically acceptable excipient. In some embodiments, the presentdisclosure provides pharmaceutical composition comprising Compound 15and a pharmaceutically acceptable excipient. In some embodiments, thepresent disclosure provides pharmaceutical composition comprisingCompound 16 and a pharmaceutically acceptable excipient. In someembodiments, the present disclosure provides method of producingCompound 3, the method comprising: providing a vector or vectorscomprising the coding sequences of SEQ ID NOs: 200-206; transforming ahost cell with the vector or vectors; incubating the host cell inculture media under conditions suitable for the expression of the codingsequences; and isolating the compound produced by the host cell.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication, patent, or patent application wasspecifically and individually indicated to be incorporated by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

The written disclosure herein describes illustrative embodiments thatare non-limiting and non-exhaustive. Reference is made to certain ofsuch illustrative embodiments that are depicted in the figures, inwhich:

FIG. 1A illustrates strategies that may be used to obtain a secondarymetabolite from a fungal strain given different properties of the fungalstrain.

FIG. 1B illustrates examples of characterized gene clusters whichproduce known chemicals and a novel gene cluster with a product.

FIG. 2 illustrates a phylogenetic analysis of enzymes that producesecondary metabolites.

FIG. 3 illustrates self-resistance mechanisms for potentially toxicsecondary metabolites.

FIG. 4 illustrates a gene cluster that produces lovastatin.

FIG. 5A illustrates an exemplary process to extract and utilize compoundproducts in accordance with an embodiment of the invention.

FIG. 5B illustrates an exemplary process to produce and extractheterologous, biosynthetic compound products in accordance with anembodiment of the invention.

FIG. 5C illustrates an example production pipeline for producingsecondary metabolites from gene clusters.

FIG. 5D illustrates an example work flow for producing secondarymetabolites from gene clusters.

FIG. 6A illustrates a yeast phase chart displaying yeast cellconcentration in relation to time to provide reference for variousembodiments of the disclosure.

FIG. 6B illustrates a yeast phase chart displaying glucose or dextroseconcentration in relation to time to provide reference for variousembodiments of the disclosure.

FIG. 6C illustrates a yeast phase chart displaying ethanol or glycerolconcentration in relation to time to provide reference for variousembodiments of the disclosure.

FIG. 7A illustrates a DNA vector having a production-phase promoter inaccordance with an embodiment of the disclosure.

FIG. 7B illustrates a DNA vector having multiple production-phasepromoters in accordance with an embodiment of the disclosure.

FIG. 8A illustrates a DNA expression vector having a production-phasepromoter within an expression cassette in accordance with an embodimentof the disclosure.

FIG. 8B illustrates a DNA expression vector having multipleproduction-phase promoters, each within an expression cassette inaccordance with an embodiment of the disclosure.

FIG. 9 illustrates a method to construct and utilize production-phasepromoter DNA vectors in accordance with various embodiments of thedisclosure.

FIG. 10A illustrates an overview of an approach involving yeasthomologous recombination assembly.

FIG. 10B illustrates the homologous recombination in yeast of the partsfrom FIG. 10A.

FIG. 10C illustrates the plasmid which results from the parts of FIG.10A, homologously recombined as in FIG. 10B.

FIG. 10D illustrates improved sequencing results obtained via disclosedmethods.

FIG. 10E illustrates assembly of plasmid DNA from up to 14 individualfragments.

FIG. 11A illustrates improved assembly efficiency in a backgroundlacking the DNL4 ligase.

FIG. 11B illustrates equivalent sequencing efficiencies using DNAprepared from both colonies (red) and liquid cultures (blue) for fourtest assemblies.

FIG. 11C illustrates sequencing efficiencies observed using bothstandard and modified NexteraXT library preparation methods.

FIG. 11D illustrates a workflow for sequencing plasmids from yeast via astep of transforming the plasmids into E. coli.

FIG. 11E illustrates a workflow for sequencing plasmids from yeast inaccordance with methods described herein.

FIG. 12 illustrates repaired SNPs in yeast strain DHY674 relative toBY4741.

FIG. 13 illustrates DHY213, BJ5464, and X303 (a W303 derivative) grownon glucose (fermentation) and ethanol/glycerol (respiration) media.

FIG. 14A illustrates relative growth rates of strains described here inYPD culture. The dotted line denotes the diauxic shift (the point atwhich the culture exhausts all glucose and transitions from fermentationto respiration). The DHY derived strain JHY692 shows significantlyimproved growth in the respiration phase of the culture.

FIG. 14B illustrates expression of eGFP driven by the PADH2 promoter inthe strains from FIG. 14A.

FIG. 14C illustrates expression of eGFP driven by the PPCK1 promoter inthe strains from FIG. 14A.

FIG. 15A illustrates an example gene cluster.

FIG. 15B illustrates a polyketide produced by a 5-gene gene cluster.

FIG. 16 is a heat map graphic generated in accordance with variousembodiments of the disclosure with data of expression of enhanced-greenfluorescent protein driven by various S. cerevisiae promoters.

FIG. 17 is a data graph of enhanced-green fluorescent protein expressiondriven by various S. cerevisiae promoters.

FIG. 18 illustrates fluorescence intensity of 10⁵ cells expressingenhanced-green fluorescent protein driven by various promoters.

FIG. 19 illustrates a phylogenetic tree of Saccharomyces sensu strictosubgenus.

FIG. 20 illustrates a multiple sequence alignment of variousSaccharomyces sensu stricto species' upstream activating sequences inADH2 promoters.

FIG. 21 illustrates homology between various Saccharomyces sensu strictospecies' ADH2 promoters.

FIG. 22 is a heat map graphic generated in accordance with variousembodiments of the disclosure with data of expression of enhanced-greenfluorescent protein driven by various S. sensu stricto ADH2 promoters.

FIG. 23 is a data graph of enhanced-green fluorescent protein expressiondriven by various S. sensu stricto ADH2 promoters.

FIG. 24 illustrates four multi-gene expression vector constructs and adata graph of the resultant compound production, in accordance with anembodiment of the invention.

FIG. 25 illustrates a biosynthetic process that produces the compoundemindole SB via a fungal four-gene cluster.

FIG. 26 is a data graph of the production results of two productcompounds generated.

FIG. 27A illustrates two plasmid vector constructs in accordance with anembodiment of the disclosure.

FIG. 27B illustrates a further vector construct in a yeast cell inaccordance with an embodiment of the disclosure.

FIG. 28A illustrates a phylogenetic analysis of further gene clusters.Abbreviations used may include Adenylation domain (A), a,b-hydrolase(a,b-h), ATP-binding cassette transporter (ABC), Acyl carrier protein(ACP), Alcohol dehydrogenase (ADH), Aldo-keto reductase (AK-red),Aminooxidase (AmOx), Aminotransferase (AmT), Arylsulfotransferase(ArST), Acyltransferase domain (AT), C-mehtyltransferase (cMT), Terpenecyclase (Cyc), Dehydratase (DH), Domain of unknown function 4246(DUF4246), Flavin adenine dinucleotide (FAD) binding protein (FAD), Irondependent alcohol dehydrogenase (Fe-ADH), Flavin-dependent monooxygenase(FMO), Geranyl-geranyl pyrophosphate synthase (GGPPS), Glycosidase(glycos.), Glucose-methanol-choline oxidoreductase (GMC), Halogenase(Halo), Highly reducing polyketide synthase (HR-PKS), Hypotheticalprotein (Hyp), Indole-3-acetic acid-amido synthetase (IAS), Ketosynthasedomain (KS), metallo-B-lactamase (mBla), Mitochodrial phosphate carrierprotein (MCP), Major facilitator superfamily transporter (MFS),Metallohydrolase (MH), Methyl transferase (MT), Nicotine adeninedinucleotide dependent dehydrogenase (NAD-DH), Nicotine adeninedinucleotide phosphate (NADP) dependent reductase (NADP-R),N-mehtyltransferase (nMT), Non reducing polyketide synthase (NR-PKS),O-succinylhomoserine sulfhydrylase (O-suc-SH), O-methyltransferase(oMT), Oxidoreductase (OxR), Cytochrome p450 (p450), Dipeptidylpeptidase (Pep), Prenyl transferase (PrT), Product template domain (PT),Riboflavin biosynthesis protein RibD (RibD), RNA helicase (RNAh),starter unit:ACP transacylase domain (SAT), Short-chain dehydrogenase(SDH), Short-chain dehydrogenase/reductase (SDR), Serine hydrolase (SH),Sugar transport protein (ST), Thiolation domain (T), Terminal domain(TD), Thioesterase domain (TE), Transcription factor (TF), and UbiA-typeterpene cyclase (UTC).

FIG. 28B illustrates various gene clusters and biosynthetic compoundproducts in accordance with various embodiments of the invention.

FIG. 28C illustrates various gene clusters and biosynthetic compoundproducts in accordance with various embodiments of the invention.

FIG. 28D illustrates various gene clusters and biosynthetic compoundproducts in accordance with various embodiments of the invention.

FIG. 28E illustrates various gene clusters and biosynthetic compoundproducts in accordance with various embodiments of the invention.

FIG. 28F illustrates various gene clusters and biosynthetic compoundproducts in accordance with various embodiments of the invention.

FIG. 29A illustrates a phylogenetic analysis of gene clusters.

FIG. 29B illustrates correction of a gene cluster.

FIG. 30A illustrates schematics of PKS enzyme containing BGCs examinedherein.

FIG. 30B illustrates schematics of UTC containing BGCs examined herein.

FIG. 31 illustrates a volcano plot of all spectral features identifiedin the automated analysis of strains expressing PKS containing BGCs. Allfeatures determined to be specific to the BGC expressing strain wereidentified by comparison to a negative vector control.

FIG. 32 illustrates features produced by BGC PKS1.

FIG. 33 illustrates features produced by BGC PKS2.

FIG. 34 illustrates features produced by BGC PKS4.

FIG. 35 illustrates features produced by BGC PKS6.

FIG. 36 illustrates features produced by BGC PKS8.

FIG. 37 illustrates features produced by BGC PKS10.

FIG. 38 illustrates features produced by BGC PKS13.

FIG. 39 illustrates features produced by BGC PKS14.

FIG. 40 illustrates features produced by BGC PKS15.

FIG. 41 illustrates features produced by BGC PKS16.

FIG. 42 illustrates features produced by BGC PKS17.

FIG. 43 illustrates features produced by BGC PKS18.

FIG. 44 illustrates features produced by BGC PKS20.

FIG. 45 illustrates features produced by BGC PKS22.

FIG. 46 illustrates features produced by BGC PKS23.

FIG. 47 illustrates features produced by BGC PKS24.

FIG. 48 illustrates features produced by BGC PKS28.

FIG. 49 illustrates NMR data and structure of Compound 6.

FIG. 50 illustrates NMR data and structure of Compound 7.

FIG. 51 illustrates NMR data and structure of Compound 8.

FIG. 52 illustrates NMR data and structure of Compound 9.

FIG. 53 illustrates NMR data and structure of Compound 10.

FIG. 54 illustrates NMR data and structure of Compound 11.

FIG. 55 illustrates NMR data and structure of Compound 12.

FIG. 56 illustrates NMR data and structure of Compound 13.

FIG. 57 illustrates NMR data and structure of Compound 14.

FIG. 58 illustrates NMR data and structure of Compound 15.

FIG. 59 illustrates NMR data and structure of Compound 16.

DETAILED DESCRIPTION

Systems and methods in accordance with various embodiments of thedisclosure identify, refactor, and express biosynthetic gene clusters inhost organisms to produce secondary metabolites. Systems and methods inaccordance with various embodiments of the disclosure utilize hostorganisms to produce secondary metabolites. Using host organisms allowsfor production of secondary metabolites from biosynthetic gene clustersregardless of whether the native host cell can be cultured, or whetherthe cluster is expressed in the native host, see FIG. 1A. In some cases,the secondary metabolites may bind to one or more specific proteins. Insome embodiments, the host organism ordinarily does not produce thesecondary metabolite. The host organism obtains the ability to producethe secondary metabolite due to the introduction of a biosynthetic genecluster identified using a cluster identification process performed inaccordance with an embodiment of the disclosure. In one embodiment, acluster identification process identifies a biosynthetic gene clusterthat possesses characteristics suggesting that it is responsible forproducing a secondary metabolite with novel chemistry. In anotherembodiment, a cluster identification process (discussed further below)identifies a biosynthetic gene cluster that possesses characteristicssuggesting that it is responsible for producing a secondary metabolitethat binds to a specific protein of interest. In some embodiments thisdisclosure provides the inclusion of a biosynthetic gene clusteridentified using a cluster identification process, in accordance withvarious embodiments of the disclosure, within a host organism enablingthe host organism to express a secondary metabolite. In some cases, thesecondary metabolite produced in the host cell may be identical to asecondary metabolite that is naturally produced by the organism in whichthe biosynthetic gene cluster was originally identified. In some cases,the secondary metabolite produced by the host cell is an analog of asecondary metabolite produced by the organism from which the cluster wasidentified. In other cases, the secondary metabolite produced in thehost cell may be structurally distinct from the secondary metaboliteproduced by the cluster in the originating species. Differences in theproduct produced may arise from differences in expression andlocalization of the coding sequences from the cluster. Additionallycoding sequences, or expression products thereof, from the cluster mayinteract with other coding sequences, or expression products thereof,not contained in the cluster. Host cell produced secondary metaboliteswhich are distinct from those produced in the originating species of thecluster may be termed non-naturally occurring secondary metabolites ornon-natural secondary metabolites. The secondary metabolite produced bythe host organism can be isolated and then can be used, for example, ina treatment. In some cases, the secondary metabolites may be used in atreatment of a disease or disorder which involves aberrant activity of aspecific protein. This disclosure also describes a panel ofsesquiterpenoid and polyketide products.

Cluster Identification

Cluster identification processes, in accordance with many embodiments ofthe disclosure, utilize specific properties of a biosynthetic genecluster to identify gene clusters of interest from sequence data.Sequence data used with the methods of this disclosure may comprisegenomic sequence data, transcriptome sequence data, or other sequencedata. In some cases, sequence data may be generated by sequencing a DNAsample obtained from an environmental sample. In other cases, sequencedata may be obtained from publically available genome sequencelibraries, or may be purchased. Genome sequences may be derived from anyorganism. In some cases, genome sequences may be derived from a fungal,bacterial, archaeal or plant species.

In some cases, the genome sequences may be derived from fungi. In somecases, the genome sequences may be derived from fungi which are poorlycharacterized or difficult to culture. In some cases, the genomesequences may be derived from fungi which are well characterized, orpartially characterized. Examples of types of fungi from which sequencesmay be derived include fungi from one of the following Phyla:Basidiomycota, Ascomycota, Neocallimastigomycota, Blastocladiomycota,Glotneromycota, Chytridiomycota and Microsporidia. Examples of fungalspecies which may be the source of sequence data include, but are notlimited to: Aspergillus tubingensis, Hypomyces subiculosus, Coniothyriumsporulosum, Acremonium Sp. KY4917, Aspergillus niger, Thielaviaterrestris, Trichoderma vixens, Pseudogymnoascus pannorum, Scedosporiumapiospermum, Metarhizium anisopliae, Cochliobolus heterostrophus,Verruconis gallopava, Moniliophthora roreri, Punctularia strigosozonata,Hydnomerulius pinastri, Arthroderma gypseum, Setosphaeria turcica,Pyrenophora teres, Cladophialophora yegresit, Talaromycescellulolyticus, Endocarpon pusillum, Hypholoma sublateritium,Ceriporiopsis subvermispora, Botryotonia cinerea, Formitiporiamediterranea, Heterobasidion annosum, Ge/atoporia subvermispora,Dichomitus squalens, Pleurotus ostreatus, Schizophyllum commune, Stereumhirsutum, Sternum hirsutum, and Dacryopinax primogenitus.

Provided in FIG. 1B is a flow chart showing a process embodiment thatcan be implemented using computer systems for identifying and rankingBSGs that are likely to produce secondary metabolites capable foranthropogenic use (e.g., medicinal). As shown, Process 1000 can begin byobtaining genetic sequence data from a biological source or sequencedatabase (1001). In some cases, the sequence data may be derived fromsingle cell sequencing of a fungal cell of unknown species. For examplean environmental sample may contain many different cells which may beseparated and sequenced. In some cases, the sequence data is obtainedfrom a publically available genetic data library, or may be purchased.Often the genetic sequence data is a genomic sequence of an organism,however, any genetic data sequence, including partial genome sequencedata, may be used.

Process 1000 also identifies biosynthetic gene clusters (1003). Clustersmay be identified using bioinformatics methods to scan genome sequences.Characteristics of a gene cluster may include a grouping of two or moregenes within about 5 kb, 10 kb, 15 kb, 20 kb, 25 kb, 30 kb, 35 kb, 40 kbor 45 kb of each other. Genes may be bioinformatically identified by thepresence of known promoter sequences, transcription initiationsequences, or homology to known genes or gene features. The termhomology as used herein refers to sequences with high sequence identity,for example a sequence identity of at least about 50%, 60%, 70%, 80%,90%, 95%, 97%, 98%, 99% or more than 99%. Sequence identity may bedetermined using alignment tools such as Basic Local Alignment SearchTool (BLAST), available via the National Center for BiotechnologyInformation (NCBI), to determine areas of conserved sequence.Biosynthetic gene clusters may be identified using a bioinformaticstool, such as ClustScan, SMURF, CLUSEAN, and/or antiSMASH.

Biosynthetic gene clusters typically contain one or more corebiosynthetic genes and one or more tailoring genes. Clusters whichcontain the same, or highly similar, core enzymes with differenttailoring genes may produce quite different compounds, as seen in FIG.1C. Expressing a subset of genes from a cluster may also result inproduction of a different compound compared with the compound producedby expression of all the genes in the cluster.

Process 1000 also scores and/or ranks gene clusters utilizing variousfactors (1005). In some cases, scores and rankings are based on the typeof secondary metabolite to be produced. In other cases, scores andrankings are based on a protein or domain existing within the cluster.In even more cases, the level of homology of a particular protein withinthe cluster to a protein of interest is considered. It should beunderstood that many different factors can be used, as determined by theapplication and use of the biosynthetic gene cluster data.

An embodiment of a process for identifying biosynthetic gene clustersusing computer systems is provided in FIG. 1D. Process 2000 can beginwith obtaining genetic sequence data of an organism having gene clusters(2001). In many cases, the genetic sequence data is a genomic sequenceof an organism. In other cases, the genetic data sequence is a partialgenome sequence data. In addition to genetic sequence data, Process 2000also obtains target sequence data keying to biosynthetic gene clustersof interest (2003). The target sequence is any sequence the user wishesto define and identify clusters. In some cases, the target sequence is aparticular protein domain of interest. In other cases, the targetsequence is a particular protein, protein homolog or protein class.

In some embodiments, clusters are scanned for the presence of genesencoding proteins known to be involved in biosynthetic pathways. Keyproteins involved in biosynthetic pathways include terpene synthases,polyketide synthases (PKSs, both highly-reducing and non-reducing),non-ribosomal peptide synthetases, UbiA-type terpene cyclases (UTCs),polyketide synthase non-ribosomal peptide synthetase hybrids, anddimethyl allyl transferases (see FIG. 2).

Process 2000 also identifies the target sequence within genetic sequencedata using homologous alignment scores (2005). Using an appropriateapplication, the sequence of the target is used to align to the geneticsequence data, looking for a threshold of homology. In some cases, apositive homologous event occurs when the target sequences aligns withthe a portion of the genetic sequence having homology of at least about50%, 60%, 70%, 80%, 90%, 95%, 97%, 98%, 99% or more than 99%. Sequencehomology may be determined using any alignment tool, such as, forexample BLAST.

Using the homologous alignment scores, candidate biosynthetic geneclusters may be identified in the region surrounding the homologousevent (2007). In many cases, the proximal upstream and downstream genesare examined to determine and define a gene cluster. In some of thesecases, 5, 6, 7, 8, 9, 10 or more proximal genes in either direction areexamined. In addition or alternatively, a gene cluster may be defined bygenetic distance of the homologous event. Gene clusters may also bedefined using a bioinformatics tool, such as ClustScan, SMURF, CLUSEAN,and/or antiSMASH. Once identified and defined, clusters may be stored asdata and/or reported via an output interface (2009).

An embodiment for ranking biosynthetic gene clusters using computersystems is provided in FIG. 1E. As shown, Process 3000 can begin withobtaining genet sequence data of multiple biosynthetic gene clusters. Inthis process, the clusters obtained each have a homolog protein ofinterest. The homolog of interest depends on the user's desired result.In many cases, the homolog of interest has a human ortholog that isknown to be involved in a human condition, disorder, or disease. In someof these cases, the human ortholog is known to have mutations, eithercongenital or somatic, that lead to a condition, disorder, or disease.In some other of these cases, the human ortholog is involved inbiological pathways that are involved in a condition, disorder, ordisease. In other cases, the homolog of interest has an ortholog in aninfectious species, including, but not limited to bacterial, fungal,protozoan species. In many of these cases, the ortholog in theinfectious species is essential to the vitality of the organisms of thespecies. In some other of these cases, the ortholog in the infectiousspecies is involved in producing a toxin. Furthermore, in many cases,the user has a desired result to prioritize clusters that will produce asecondary metabolite that may target the of human or infectious speciesortholog.

Identifying a gene cluster that may produce a secondary metabolite thatbinds a target protein may involve multiple different steps. In somecases, such a gene cluster may be identified by the presence of one ormore genes encoding a homolog of the target protein, within or adjacentto the cluster, as determined by a homology search (for example, usingthe tblastn algorithm, with a maximum score granted when one homolog isfound). In some cases, a gene cluster may be identified by theconfidence in homology of the target to a gene or genes in a cluster.For example, according to the tblastn algorithm, gene clusterscontaining a gene with an e value less than about 10⁻¹⁰, 10⁻²⁰, 10⁻³⁰,10⁻³⁵ or 10⁻⁴⁰ may be selected. Gene clusters may also be selected usinga protein blast method such as blastp to compare a predicted proteinsequence against either known protein sequence or other predictedprotein sequence. For blast searches using a known or predicted proteinsequence gene clusters containing a gene with an e value less than about10⁻¹⁰, 10⁻²⁰, 10⁻³⁰, 10⁻³⁵ or 10⁻⁴⁰ may be selected. Selected geneclusters may be prioritized with increasing priority scores for lowere-values. In some cases, a gene cluster may be identified by thefraction of the homologous gene that meets a certain threshold ofidentity (for example, with an increasing score for more identity, and alower bound threshold of at least about 99%, 95%, 80%, 85%, 75%, 70%,65%, 60%, 50%, 45%, 40%, 35%, 30%, 25%, 20% or 15% coverage). In somecases, a gene cluster may be selected if it contains a gene whichproduces a protein with at least about 20%, 30%, 40%, 50%, 60%, 70%,80%, 90%, 95%, 97%, 98%, 99% or 100% homology to the target protein. Insome cases, a protein that is encoded by the region that is included in,or positioned proximal to, a BGC may be identical to, or have at leastabout 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 99% or about 100% homologyto a target protein. In some cases, a gene cluster may be identified bythe total number of genes homologous to the target protein present inthe entire genome of the organism (for example, with a maximum scoregranted to cases with 2-4 homologs per genome). In some cases, a genecluster may be identified by the homology of the gene in, or adjacentto, the cluster to the target protein (for example, using the blastxalgorithm, with a maximum score granted when the gene in thebiosynthetic gene cluster's closest homolog in the target protein'sgenome is the target protein itself). In some cases, a gene cluster maybe identified by the phylogenetic relationship of the target protein tothe gene in the cluster (for example, with an increasing score forhomologs in the gene cluster that clade with the target protein, withconfidence assigned by a bootstrap test or Bayesian inference ofphylogeny, and a lower bound threshold defined as homologs in aphylogenetic context that appear in a clade with bootstrap value of 0.7or Bayesian posterior probability of 0.8). In some cases, a gene clustermay be identified by the expected number of homologs of the target in oradjacent to the biosynthetic cluster (for example, with a greater scorethe lower the probability of a homolog of the target being present in oradjacent to a biosynthetic cluster of a certain size, given the numberof total homologs in the genome, as determined by a permutation test).In some cases, a gene cluster may be identified by the likelihood thatthe target protein is essential for viability, growth, or other cellularprocesses in the native environment (for example, through evidence thatdeletion of homologs in related organisms (such as S. cerevisiae) renderthe organism inviable). In some cases, a gene cluster may be identifiedby one or more of the above methods, or by any two or more of the abovemethods.

Utilizing identification steps, Process 3000 also scores eachbiosynthetic gene cluster based on factors indicative of secondarymetabolite synthesis related to the homolog of interest (3003).Accordingly, a score is constructed for each biosynthetic cluster basedon one or more of the following:

(a) the presence of one or more homologs of target orthologs within oradjacent to the cluster, as determined by a homology search (e.g., theBLAST algorithm);(b) the degree of homology of one or more target orthologs to genes in acluster (e.g., as defined by the e-value);(c) the fraction of the homologous gene that meets a certain threshold(e.g., the number of homologous protein domains);(d) the total number of genes homologous to the target ortholog presentin the entire genome of the organism;(e) the degree of homology of a gene in or adjacent to the cluster tothe target ortholog(f) the number of homologs in the gene family (e.g., the number ofhomologs of a human target gene in the human genome); and/or(g) the expected number of homologs of the target ortholog in, orneighboring, a biosynthetic cluster (e.g., the probability of a homologof the target being present in or adjacent to a biosynthetic cluster ofa certain size, given the number of total homologs in the genome, and asdetermined from a permutation test)(h) the synteny of the gene cluster with related species (e.g.,conservation of gene cluster) (i) the function class of the targetortholog(j) the presence of specific promoters adjacent to the homolog(s) withinthe gene cluster (e.g., identification of bidirection promoter upstreamthe homolog and biosynthetic gene)(k) the presence of specific regulatory elements in the biosyntheticcluster (e.g., the number of transcription factor binding sites sharedamong target orthologs and homologs/biosynthetic genes in the cluster)(l) the presence of homologs outside the cluster that are co-regulatedwith some or all the genes within the biosynthetic cluster(m) the presence of protein- and DNA-sequence derived features withinthe clusters that have successfully been shown to produce secondarymetabolites. It should be understood that a particular user may desireto use one, some, or all the factors listed here, and/or other factorsnot listed. The factors utilized depend on the user's application anddesired result.

Process 3000 also has the option to calibrate the score to byreferencing a set of “true positives” (i.e., cases where there is one ormore known targets in or adjacent to a biosynthetic cluster thatproduces a small molecule targeting the target, such as the lovastatinBGC) (3005). The output of this process may be a ranked and/or scoredlist of biosynthetic clusters, which may be used to identify theclusters that will produce therapeutic small molecules targeting theproducts of a disease-related gene (3007). The ranked and/or scored listof clusters can be stored as data or reported via an output interface(3009).

Turning now to FIG. 1F, computer systems (4001) may be implemented on asingle or multiple computing devices in accordance with some embodimentsof the invention. Computer systems (4001) may be personal computers,laptop computers, and/or any other computing devices with sufficientprocessing power for the processes described herein. The computersystems (401) include a processor (403), which may refer to one or moredevices within the computing devices that can be configured to performcomputations via machine readable instructions stored within a memory(4007) of the computer systems (4001). The processor may include one ormore microprocessors (CPUs), one or more graphics processing units(GPUs), and/or one or more digital signal processors (DSPs). Accordingto other embodiments of the invention, the computer system may beimplemented on multiple computers.

In a number of embodiments of the invention, the memory (4007) maycontain a gene cluster identification and/scoring application (4009)that performs all or a portion of various methods according to differentembodiments of the invention described throughout the presentapplication. As an example, processor (4003) may perform a gene clusteridentification and/or scoring method similar to any of the processesdescribed above with reference to FIGS. 1D and 1E, during which memory(4007) may be used to store various intermediate processing data such asthe genetic sequence alignment data (e.g., BLASTn) (4009 a),identification of key target sequences with gene clusters (4009 b),identification of homolog(s) to a protein of interest (4009 c),characterization of homologs (4009 d), scores and/or ranks of geneclusters (4009 e), and calibration of gene cluster scores (4009 f).

In some embodiments of the invention, computer systems (4001) mayinclude an input/output interface (4005) that can be utilized tocommunicate with a variety of devices, including but not limited toother computing systems, a projector, and/or other display devices. Ascan be readily appreciated, a variety of software architectures can beutilized to implement a computer system as appropriate to therequirements of specific applications in accordance with variousembodiments of the invention.

Although computer systems and processes for chimeric sequence unveilingand performing actions based thereon are described above with respect toFIG. 1F, any of a variety of devices and processes for data associatedwith cluster identification and/or scoring as appropriate to therequirements of a specific application can be utilized in accordancewith many embodiments of the invention.

In some embodiments, gene sequences within a novel cluster, or a set ofnovel clusters, may be compared to gene sequences from known andcharacterized biosynthetic gene clusters. In some cases, phylogeneticcomparisons may be carried out between gene sequences in a novel clusterand gene sequences from characterized gene clusters. As shown in FIG. 2,of the many biosynthetic enzymes which have been identified fromsequence data, only a small fraction have been characterized, suggestingpotential for many novel chemistries. Phylogenetic analysis may beperformed on the core biosynthetic gene or genes, or on one or moretailoring genes of the novel cluster. Preferred clusters may be clusterscontaining one or more gene sequences which do not share a closephylogenetic relationship with a sequence from a characterized genecluster. In some cases, gene clusters may be ordered according to theirphylogenetic relationship to characterized gene clusters, clusters withthe most distant relationships may be preferred for further analysis.

In several embodiments, a specific secondary metabolite is used as aweapon against another organism (i.e., is toxic to or inhibits thegrowth of specific type of organism). In many instances, the toxicsecondary metabolite may also be toxic to the organism that produces thesecondary metabolite. Accordingly, the producing organism may defendagainst self-harm in a number of ways including (but not limited to)pumping the secondary metabolite out of the cell, enzymatically negatingthe secondary metabolite, or producing an additional version of theprotein targeted by the secondary metabolite that is less sensitive orinsensitive to the secondary metabolite, (see FIG. 3). In instances inwhich an organism produces an additional version of the target protein,the “protective” version of the gene that encodes the additional versionof the target protein that is less sensitive to the secondary metaboliteis often colocalized with the biosynthetic gene cluster, for example theHMGR gene in the lovastatin cluster shown in FIG. 4. Although differentto the gene that produces the target protein, the protective version ofthe gene should produce a protein that maintains detectable homology tothe target protein. In several embodiments, the cluster identificationprocess takes advantage of this homology to identify those biosyntheticgene clusters that contain or are adjacent to a gene that encodes aprotective homolog of the target protein. In a number of embodiments ofthe disclosure, the genetic sequences of multiple organisms are analyzedto detect biosynthetic gene clusters possessing this characteristic.

A target protein may be any protein of interest which has a homolog inthe genome sequence/s from which the gene clusters were obtained. Insome cases, the target protein is an enzyme. In some cases, the targetprotein is a signaling protein. In some cases, the target protein is onewhich is required by the species of origin. For example, the targetprotein may be one which contributes to viability of growth of the cell,and deletion or inactivation of the target protein from the cell mayhave deleterious effects on the viability or growth of the cell. In somecases, the target protein may have a vertebrate or mammalian homolog. Insome cases, the target protein has a human homolog. The human homologmay be a protein which is dysregulated in a disease. An example of agene cluster which was identified using methods disclosed herein, andwhich comprises a homolog of the BRSK1 gene is shown in FIG. 15A.

In some embodiments, a biosynthetic cluster of interest which produces asecondary metabolite that interacts with a target protein may alsoproduce a protein for inactivating the secondary metabolite or forsecreting the secondary metabolite from the cell in which it isproduced. A protein which inactivates the secondary metabolite may beomitted when designing an expression construct to express this clusterin a host cell. A protein involved in secreting the secondary may beincluded or omitted when designing an expression construct to expressthis cluster in a host cell. To identify such clusters severalapproaches may be used. For example a set of biosynthetic gene clustersmay be identified from genome data and the identified clusters may beanalyzed for the presence of enzymes with activities that may be usefulfor degradation of a secondary metabolite. In another example, a set ofbiosynthetic gene clusters may be analyzed for the presence of proteinsinvolved in transport or secretion pathways. In another example ahomology search may be identified across one or more genome sequences tofind genes encoding proteins homologous to an enzyme known to degrade atoxic compound. Once such genes have been identified they may beanalyzed for proximity to a biosynthetic gene cluster. Proximity to agene cluster may be defined as within about 50 kb, 40 kb, 30 kb, 20 kb,10 kb, 5 kb, 1 kb, or less than 1 kb.

In some embodiments, a gene cluster which produces a secondarymetabolite that may be toxic to the host cell may also contain signalsto direct the production of the secondary metabolite to a specificcellular location. In some cases, the enzymes may be membrane tetheredto the intracellular or extracellular side of a cell membrane, or may besecreted through a membrane to the intracellular or extracellular side(including the insides of organelles and vacuoles). For example, theenzymes of the cluster may contain membrane-targeting signals to targetthe enzymes to either the extracellular membrane or to an intracellularorganelle or vacuole. In some cases, the enzymes may be targeted to theextracellular membrane in an orientation which results in the activeregion of the enzyme being inside of an organelle or vacuole. In somecases, the enzymes may be targeted to the extracellular membrane in anorientation which results in the active region of the enzyme beingoutside of the cell. Such clusters may be identified by analyzing thepredicted proteins of the cluster for the presence of peptide targetingsignals or of terminal sequences with homology to targeting signals.

In some cases, a genome or genomes may be searched for gene clusters andthe set of identified gene clusters may be searched for genes which arehomologous to a target gene, or which produce proteins homologous to atarget protein. In other cases, a genome or genomes may be searched fora gene or genes homologous to a target gene, and the identified genesmay be analyzed for association with a gene cluster. In other cases, agenome or genomes may be searched for gene clusters and the set ofidentified gene clusters may be phylogenetically analyzed to determinerelationships between the identified gene clusters and known,characterized, gene clusters. In yet other cases, a novel genome orgenomes may be searched for sequences distantly homologous to a knownbiosynthetic enzyme, and the identified genes may be analyzed forassociation with a gene cluster.

Specific secondary metabolites synthesized using methods in accordancewith a number of embodiments of the disclosure and the proteins utilizedto identify the biosynthetic gene clusters used to synthesize thesecondary metabolites and targeted by the secondary metabolites aredescribed herein. An example of a method which may be used to produce asecondary metabolite is shown in FIG. 5. As shown in an exemplifiedembodiment process in FIG. 5A, Process 100 generates and extracts aheterologous compound for use. Exemplary Process 100 can begin bysearching genetic data of various organismal species for pathways orBGCs that produce a compound product (101). The organismal species usedin this step can be any species. For example, fungal and bacterialspecies often contain multiple BGC pathways that are encoded in its DNA.Likewise, the genetic data to be searched can be any genetic dataavailable or determinable by the user. Accordingly, on one end of thespectrum, the genetic data may be a fully annotated, publicly availablegenome of a well-studied species (e.g., Penicillium notatum). On theother end of the spectrum, the genetic data may be a publiclyunavailable, partial genomic sequence of a newly discovered speciesincapable of anthropogenic cultivation, wherein the partial sequence isfound to have a gene cluster that may produce a compound.

Exemplary embodiment Process 100 may continue by using the genetic datato reconstruct the compound product pathway in an acceptable geneticexpression system (103). Often, to reconstruct the compound pathway, thegenetic data is used to create nucleic acid molecules (e.g., DNA)comprising the coding sequences of the pathway genes sufficient toproduce the product in the acceptable genetic expression system. Thenucleic sequences are to be transferred into the expression system.Expression systems are any organism capable of producing theheterologous compound by heterologous expression of the pathway genes.Typical expression systems include (but are not limited to) E. coli andS. cerevisiae.

Once the compound product pathway is reconstructed and transferredwithin an expression system, the expression system produces the compound(105) in exemplary Process 100. Typically, production of the compoundresults from coordinated expression of the pathway genes in theexpression system. The coordinated expression of the pathway genesresults in the production of the enzymes primarily responsible forconstructing the heterologous compound product.

FIG. 5B depicts another exemplary embodiment process. Exemplary Process200 produces, extracts, and characterizes a biosynthetic compoundderived from heterologous expression of a gene cluster. The process maybegin with the identification and selection of a gene cluster with anidentifiable trait that is indicative of compound production (201).Several indicative processes to select gene clusters exist, includingseveral computer-implemented programs. One such program is antiSMASH2.0,which is platform for mining BGC clusters for production of secondarymetabolites that searches for core structures to identify putative BGCs(K. Blin, et al. Nucleic Acids Res. 41:W204-12, 2013, the disclosure ofwhich is incorporated herein by reference in its entirety). Another suchmethod is described herein which utilizes homolog sequences within aproximal region in the genome to identify putative BGCs. It should benoted that many other methodologies could be used to select BGCs.

Once a gene cluster has been selected, Process 200 continues byappropriating nucleic acid molecules with the coding sequences of thevarious genes within the cluster (203 in FIG. 5B). Typically, thenucleic molecules are DNA, but other nucleic molecules (e.g., RNA) canbe used for certain applications. When extracting gene sequence datafrom the host organism, it is usual to remove the non-translatedportions (e.g., UTRs, introns) from the gene, leaving only the codingsequence, but the non-translated portions may also be used, especiallyif they provide a beneficial characteristic. Appropriation of thenucleic molecules can be performed by many different methods including(but not limited to) direct extraction from the host, chemicalsynthesis, and/or cDNA generation methods (e.g., reverse transcriptionof host RNA). Regardless of the method used, the resulting nucleic acidmolecules may be available to build into expression vectors forheterologous expression.

Exemplary Process 200 utilizes the appropriated nucleic acid moleculesto assemble expression vectors for expression in an appropriateorganismal expression system (e.g., E. coli, S. cerevisiae). Expressionvectors are nucleic acid molecules that have the necessary components toexpress a heterologous gene in the expression system. Common expressionvectors are plasmid DNA and viral vectors, but also include kits of DNAmolecules that can be joined together to form a longer DNA molecule by arecombination methodology (e.g., yeast homologous recombination (YHR)).

To express a heterologous gene from an expression vector, an expressioncassette is may be used, which comprises the sequences of an appropriatepromoter and an appropriate terminator along with the heterologous genesequence. The promoter is typically located upstream of the heterologousgene and can regulate the gene's expression. Many different types ofpromoters can be used. The selection of the appropriate promoter dependson the application and expression profile desired. For example, in theS. cerevisiae expression system, production-phase promoters may expressheterologous genes only in the production-phase of the yeast culture'slife cycle, which may have desirable properties. For more description ofproduction-phase promoters, please refer to the related U.S. patentapplication Ser. No. 15/469,452 (“Inducible Production-Phase PromotersFor Coordinated Heterologous Expression in Yeast”), the disclosure ofwhich is incorporated herein by reference in its entirety. However, itshould be understood, that constitutive promoters and otherresponse-driven promoters could be used within the system.

The sequences of promoters to be used in an expression vector can bederived from various sources. E. coli expression systems often use theT7 promoter derived from the T7 bacteriophage because the promoterreliably produces high expression in E. coli. Endogenous promotersequences (e.g., the lac operon in E. coli) are expected to perform wellwithin the organismal expression system.

Expression vectors often have other sequences that benefit duplication,selection and stability of the vector within the organismal expressionsystem, in addition to the expression cassette. In several instances,some of these sequences are necessary for maintenance in the expressionsystem. For example, plasmid vectors within an E. coli or S. cerevisiaehost require a host origin of replication and a selectable marker. Theorigin of replication signals the host expression system to replicatethe plasmid vector in order to produce more copies of the plasmid as thehost cells duplicate and divide. The selectable marker ensures that onlythe host cells that contain the vector continue to survive andpropagate. Accordingly, these sequences may be necessary for viableheterologous expression.

Once the expression vector is assembled, the heterologous BGC genes areexpressed using the organismal expression system (207). Accordingly, theexpression vector is to exist within the organismal host such that thehost will express the heterologous BGC genes to produce the encodedenzymes. The enzymes produce a biosynthetic compound. This compound isbe extracted from the expression system (209).

Extracted heterologous biosynthetic compounds can be characterized todetermine their various structures and conformations. Some resultantproducts may have a solitary structure and conformation while otherproducts will have several different structures with multipleconformations. The various structures and conformations can bedetermined using mass spectrometry, chromatography, and/or othermethods.

There are numerous classes of biosynthetic compounds. For example,polyketides and terpenes are a class of compounds derived from variousorganismal species. Many novel biosynthetic compounds are likely to havebeneficial properties, as a multitude of biosynthetic compounds havebeen found to be useful in several industries.

Illustrated in FIG. 5C is an exemplary pipeline to produce heterologous,biosynthetic compounds. Exemplary Pipeline 300 takes advantage of ayeast expression system to reproduce a fungal BGC in order to produce aheterologous compound product.

Pipeline 300 begins with selection of a biosynthetic gene cluster (301).Depicted, as an example, is a phylogenetic tree of numerous fungal BGCs.Using the phylogenetic data, a BGC having a desired trait is selected.The coding sequences of the various BGC genes are then used tochemically synthesize DNA molecules (303). The synthetized BGC DNAmolecules are then to be assembled into a heterologous expressionconstruct (305). In this example, the DNA molecules are assembled byyeast homologous recombination. Accordingly, the synthesized DNAmolecules have overlapping homologous sequences that the yeast will useto recombine the various DNA molecules into a plasmid DNA vector.

Pipeline 300 then utilizes the assembled expression vectors to maintainand express the BGC in the yeast (307). The expression of the variousheterologous genes results in expression of a number of heterologousenzymes that then can produce the heterologous biosynthetic compounds.Once a sufficient titer of compound is produced, it can be characterizedto determine its structures and properties (309).

Briefly, the method comprises gene cluster selection as discussedherein, synthesis of coding sequence, promoters and terminators,assembly of the cluster coding sequence, expression in a fungal host,and isolation and characterization of compounds produced. An example ofa gene cluster identified using the methods herein, and the compoundcreated by expression of the identified genes in yeast, is shown in FIG.15B.

Cluster Engineering

Once a gene cluster is identified according to the methods describedherein, the cluster may be prepared for expression in a heterologoushost cell. Editing of a putative gene cluster may involve steps such as:removal of introns, replacement of promoters, replacement ofterminators, gene shuffling, and codon optimization. For example, if agene cluster is to be expressed in S. cerevisiae then coding sequencesfrom the gene cluster may be codon optimized for S. cerevisiae, andoperably linked to S. cerevisiae promoters and terminators.

The gene cluster editing may rely on automatic annotation of expressedsequences, introns, and exons, or on manual inspection of the cluster.The gene cluster editing may be done in silico using the sequence data,or may be done in vitro or in vivo using the DNA sequence in a suitablevector such as a cloning vector. In some cases an initial edited genecluster may not produce a product and reanalysis of the predicted codingsequences, and introns of the same, may reveal errors in the predictedtranscription start sites, transcription termination sites and/or splicesites.

In an embodiment, this disclosure provides sequences (SEQ ID NOs:67-483) of cryptic BGCs which encode various products. These BGCs mayalso be reengineered to provide the coding sequences without theendogenous regulatory sequences. The coding sequences from theseclusters may be isolated and cloned into one or more expression vectorsfor expression in a model host system. The expression vectors may beplasmids, viruses, linear DNA, bacterial artificial chromosomes or yeastartificial chromosomes. The expression vectors may be designed tointegrate into the host genome, or to not integrate. In some cases, theexpression vector may be a high copy number plasmid.

Promoters

Expression of a refactored gene cluster in a host organism may requirecoordinated expression of several different coding sequences. In somecases, the expression of multiple different coding sequences in a hostorganism may require the use of multiple different promoters suitablefor that organism. This disclosure provides methods for discoveringmultiple promoters with similar activities and expression patterns butwith differing DNA sequence. Such methods may involve use of closelyrelated species, such as different species of Saccharomyces.Saccharomyces (S.) is a genus of fungi composed of different yeastspecies. The genus can be divided into two further subgenera: S. sensustricto and S. sensu lato. The former have relatively similarcharacteristics, including the ability to interbreed, exhibiting uniformkaryotype of sixteen chromosomes, and their use in the fermentationindustry. The later are more diverse and heterogeneous. Of particularimportance is the S. cerevisiae species within the S. sensu strictosubgenus, which is a popular model organism used for genetic research.

The yeast S. cerevisiae is a powerful host for the heterologousexpression of biosynthetic systems, including production of biofuels,commodity chemicals, and small molecule drugs. The yeast's genetictractability, ease of culture at both small and large scale, and a suiteof well-characterized genetic tools make it a desirable system forheterologous expression. Occasionally, production systems requirecoordinated expression of two or more heterologous genes. Coordinatedexpression systems in bacteria (e.g., E. coli) has long exploited theoperon structure of bacterial gene clusters (e.g., lac operon), allowinga single promoter to control the expression of multiple genes.

The construction of synthetic operons therefore allows a singleinducible promoter to control the timing and strength of expression ofan entire synthetic system. In yeast, many heterologous-expressionsystems do not rely on the operon system, but instead rely on aone-promoter, one-gene paradigm. Accordingly, multi-gene heterologousexpression is generally performed using multiple expression cassetteswith a well-characterized promoter and terminator, each on a singleexpression vector (e.g., plasmid DNA) (See D. Mumberg, R. Muller, and M.Funk Gene 156:119-22, 1995). With traditional restriction-ligationcloning, it is also possible to recycle a promoter on a single plasmidby the serial cloning of multiple genes (M. C. Tang, et al., J Am ChemSoc 137:13724-27, 1995).

Turning now to the drawings and data, disclosed embodiments aregenerally directed to systems and constructs of heterologous expressionduring the production phase of yeast. In many of these embodiments, theexpression system involves coordinated expression of multipleheterologous genes. More embodiments are directed to production-phasepromoter systems having promoters that are inducible upon an event inthe yeast's growth or by the nutrients and supplements provided to theyeast. Specifically, a number of embodiments are directed to thepromoters that are capable of being repressed in the presence of glucoseand/or dextrose. In more embodiments, the promoters are capable of beinginduced in the presence of glycerol and/or ethanol. In additionalembodiments, at least one production-phase promoter exists within anexogenous DNA vector, such as (but not limited to), for example, ashuttle vector, cloning vector, and/or expression vector. Embodimentsare also directed to the use of expression vectors for the expression ofheterologous genes in a yeast expression system.

Controlled gene expression is desirable in heterologous expressionsystems. For example, it would be desirable to express heterologousgenes for production during a longer stable phase. Accordingly,decoupling the anaerobic growth and aerobic production phases of aculture allows the yeast to grow to high density prior to introducingthe metabolic stress of expressing unnaturally high amounts ofheterologous protein. In accordance with many embodiments, the anaerobicgrowth phase is defined by the yeast culture's energy metabolism inwhich the yeast cells predominantly catabolize fermentable carbonsources (e.g., glucose and/or dextrose), and a high growth rate (i.e.,short doubling-time). In contrast, and in accordance with severalembodiments, the aerobic production phase is defined by the yeastculture's energy metabolism in which the yeast cells predominantlycatabolize nonfermentable carbon sources (e.g., ethanol and/orglycerol), and a steady growth rate (i.e., long doubling-time).Accordingly, each yeast cell's energy metabolism can be predominantly inaerobic or nonaerobic phase, and dependent on the local concentration ofthe carbon source.

FIG. 6A depicts the phases of a yeast culture when provided afermentable sugar, such as glucose or dextrose sugar, at a concentrationof around 2-4% as its main carbon source. Initially, a yeast culturewill predominantly catabolize the fermentable sugar, which correlateswith an exponential growth with very high doubling rates. The growthphase typically lasts approximately 4-10 hours. During this phase, thecatabolism of the fermentable sources results in the production ofethanol and glycerol.

Once glucose becomes scarce, the growth of a yeast culture passes adiauxic shift and begins to predominantly catabolize nonfermentablecarbon sources (e.g., ethanol and/or glycerol) (FIG. 6B). Thepredominant catabolism of nonfermentable carbon source correlates with alonger and more stable production phase that can last for several days,or even weeks in an industrial-like setting (FIG. 6A). During theproduction phase, yeast cultures reach and maintain a highconcentration, but have a much lower doubling time (FIG. 6A). Due to thedecrease in doubling rate, yeast cultures no longer expend a greatamount of energy and resources on rapid growth and thus can reallocatethat energy and those resources to other biological activities,including heterologous expression. Accordingly, it is hypothesized thatlimiting the transcription of heterologous genes to the production phasewould allow a yeast culture to reach a high, healthy confluency thatwould in turn allow better heterologous protein expression andbiosynthetic production.

In yeast, transcriptional regulation can be achieved in several ways,including inducement by chemical substrates (e.g., copper ormethionine), the tetON/OFF system, and promoters engineered to bindunnatural hybrid transcription factors. Perhaps the most commonlyemployed inducible promoters are the promoters controlled by theendogenous GAL4 transcription factor. GAL4 promoters are stronglyrepressed in glucose, and upon switching to galactose as a carbonsource, strong induction of transcription is observed (M. Johnston andR. W. Davis, Mol. Cell Biol. 4:1440-48, 1984). While this system leadsto high-level transcription, only four galactose-responsive promotersare known, and galactose is both a more expensive and a less efficientcarbon source as compared to glucose (S. Ostergaard, et al., Biotechnol.Bioeng. 68:252-59, 2000). Other carbon-source dependent promoters havealso been used for heterologous gene expression. The S. cerevisiae ADH2gene exhibits significant derepression upon depletion of glucose as wellas strong induction by either glycerol or ethanol (K. M. Lee & N. A.DeSilva Yeast. 22:431-40, 2005). Once induced, genes driven by the ADH2promoter (pADH2) display expression levels equivalent to those driven byhighly expressed constitutive counterparts. This induction profile wasfound to work in heterologous expression studies, as the systemauto-induces upon glucose depletion in the late stages of fermentativegrowth after cells have undergone diauxic shift. The ADH2 promoter hasbeen used extensively for yeast heterologous expression studies,resulting in high-level expression of several heterologous biosyntheticproteins (For example, see C. D. Reeves, et al., Appl. Environ.Microbiol. 74:5121-29, 2008).

As shown in FIG. 6C, the concentration of ethanol and glycerol increasesas glucose and dextrose sugar decreases, due to anaerobic glycolysis(i.e., breaking down the fermentable sugar) and subsequent fermentation(i.e., converting the broken-down glucose into alcohol) and glycerolbiosynthesis (i.e., converting the broken-down glucose into glycerol).Upon fermentable sugar depletion, yeast cultures undergo a diauxic shiftand begin to use ethanol and glycerol as a carbon source instead ofglucose. A diauxic shift, as understood in the art, is defined as apoint in time when an organism switches from primarily consumption ofone source for energy, to primarily another source. This shift typicallyelicits significant changes to a yeast culture's gene-expressionpattern. Accordingly, it is hypothesized that higher concentrations ofethanol, (e.g., ˜2-4%) and or glycerol (e.g., ˜2%) could be used tostimulate promoters that either directly or indirectly respond to theseconcentrations (See FIGS. 6A and 6C).

Various disclosed embodiments are based on the discovery of induciblepromoters that can be used for the coordinated expression of multiplegenes (e.g., gene cluster pathway) in Saccharomyces yeast. Describedbelow are sets of inducible promoters from S. cerevisiae and relatedspecies that are inactive during anaerobic growth, activatingtranscription only after a diauxic shift when glucose is near-depletedand the yeast cells are respiring (i.e., the production phase). Asportrayed in various embodiments, various production-phase promoters areauto-inducing and allow automatic decoupling of the growth andproduction phases of a culture and thus initiate heterologous expressionwithout the need for exogenous inducers. It should be noted, however,that many embodiments include production-phase promoters that are alsoinducible in the presence of nonfermentable carbon-sources (e.g.,ethanol and/or glycerol) supplied to the yeast. As such, multipleembodiments employ recombinant production-phase promoters that act muchlike constitutive promoters when the host yeast cultures are constantlymaintained in ethanol and/or glycerol-containing media.

Once activated, the strength of various production-phase promoters canvary as much as 50-fold. The strongest production-phase promotersstimulate heterologous expression greater than that observed from strongconstitutive promoters. The production-phase promoters could be employedin many different applications in which high expression of multiplegenes is beneficial. Accordingly, the promoters can be used, forexample, in multiple subunit protein production or for the production ofbiosynthetic compounds that are produced by multiple proteins within apathway. Discussed in an exemplary embodiment below, some embodimentsare used to express multiple proteins involved in production of indolediterpene compound product. When compared to constitutive promoters, theproduction-phase promoters produced greater than a 2-fold increase intiter of the exemplary diterpene compounds. In other exemplaryembodiments, it was found that the production-phase promoter systemoutperformed constitutive promoters by over 80-fold. Thus, thesepromoters can enable heterologous expression of biosynthetic systems inyeast.

The practice of several embodiments will employ, unless otherwiseindicated, conventional methods of chemistry, biochemistry, andmolecular biology and recombinant DNA techniques within the skill of theart. Such techniques are explained fully in the literature. See, e.g.,A. L. Lehninger, Biochemistry (Worth Publishers, Inc., 30 currentaddition); Sambrook, et al., Molecular Cloning: A Laboratory Manual (3rdEdition, 2001); Methods In Enzymology (S. Colowick and N. Kaplan eds.,Academic Press, Inc.).

Inducible Production-Phase Promoters for Heterologous Expression inYeast

In accordance with several embodiments, inducible production-phasepromoters can be constructed into exogenous expression vectors forproduction of at least one protein in Saccharomyces yeast. In manyembodiments, the constructed expression vectors have multiple inducibleproduction-phase promoters in order to express multiple heterologousgenes. Several embodiments are directed to production-phase promotersand DNA vectors incorporating these promoters. Promoters, in general,are defined as a noncoding portion of DNA sequence situated proximatelyupstream of a gene to regulate and promote its expression. Typically, inS. cerevisiae and similar species, the promoter of a gene can be foundwithin 500-bp upstream of a gene's translation start codon. In somecases, a promoter may be about 500 bp, 600 bp, 700 bp, 800 bp, 900 bp, 1kb, 1.5 kb, 2 kb or more than 2 kb upstream of a gene's transcriptionstart site.

In accordance with several embodiments, production-phase promoters havetwo defining characteristics. First, production-phase promoters arecapable of repressing heterologous expression of a gene in S. cerevisiaeand similar species when the yeast is exhibiting anaerobic energymetabolism. As described previously, yeast exhibit anaerobic metabolismin the presence of a nontrivial concentration of fermentable carbonsources such as, for example, glucose or dextrose. In addition,production-phase promoters are also capable of inducing heterologousexpression of a gene in S. cerevisiae and similar species when the yeastis exhibiting aerobic energy metabolism. As described previously, yeastexhibit aerobic metabolism when fermentable carbon sources are neardepleted and the yeast cells switch to a catabolism of nonfermentablecarbon sources such as glycerol or ethanol. These characteristicscorrespond to the phase charts in FIGS. 6A-6C. Tables 1 and 2 provideseveral examples of production-phase promoters in accordance withseveral embodiments.

The production-phase promoters can be characterized based on their levelof transgene expression relative to each other and to constitutivepromoters. As described in an exemplary embodiment below, it was foundthat the sequence of endogenous promoters of the S. cerevisiae genesADH2, PCK1, MLS1, and ICL1 exhibited high-level expression and thus canbe characterized as strong production-phase promoters (Table 1).Sequences of the endogenous promoters of the S. cerevisiae genesYLR307C-A, ORF-YGR067C IDP2, ADY2, CACI, ECM13, and FAT3 exhibitedmid-level expression and thus can be characterized as semi-strongproduction phase promoters (Table 1). In addition, sequences of theendogenous promoters of the S. cerevisiae genes PUT1, NQM1, SFC1, JEN1,SIP18, ATO2, YIG1, and FBP1 exhibited low-level expression and thus canbe characterized as weak production-phase promoters (Table 1).

TABLE 1 Production-Phase Promoters Expression Phenotype Sequence GeneName Systematic Name Expression Phenotype ID Number ADH2 YMR303C Strong1 PCK1 YKR097W Strong 2 MLS1 YNL117W Strong 3 ICL1 YER065C Strong 4YLR307C-A YLR307C-A Semi-Strong 5 YGR067C YGR067C Semi-Strong 6 IDP2YLR174W Semi-Strong 7 ADY2 YCR010C Semi-Strong 8 GAC1 YOR178CSemi-Strong 9 ECM13 YBL043W Semi-Strong 10 FAT3 YKL187C Semi-Strong 11PUT1 YLR142W Weak 12 NQM1 YGRO43C Weak 13 SFC1 YJR095W Weak 14 JEN1YKL217W Weak 15 SIP18 YMR175W Weak 16 ATO2 YNR002C Weak 17 YIG1 YPL201CWeak 18 FBP1 YLR377C Weak 19

The closely related S. sensu stricto species have similar genetics andgrowth characteristics. Accordingly, the phase charts provided in FIGS.6A-6C apply generally to S. sensu stricto species. Table 2 provides alist of strong production-phase exogenous promoters of similarly relatedspecies in accordance with numerous embodiments of the disclosure.

TABLE 2 Strong Production-Phase Promoters of S. sensu stricto speciesSpecies Gene Name Sequence ID Number S. paradoxus ADH2 36 S.kudriavzevii ADH2 37 S. bayanus ADH2 38 S. paradoxus PCK1 41 S.kudriavzevii PCK1 42 S. bayanus PCK1 43 S. paradoxus MLS1 44 S.kudriavzevii MLS1 45 S. bayanus MLS1 46 S. paradoxus ICL1 47 S.kudriavzevii ICL1 48 S. bayanus ICL1 49

It should be noted that substantially similar sequences to theproduction-promoter sequences are expected to regulate heterologousexpression in S. cerevisiae and achieve similar results. Accordingly, asubstantially similar sequence of a production-phase promoter, inaccordance with numerous embodiments, is any sequence with a highfunctional equivalence such that when regulating heterologous expressionin S. cerevisiae that it achieves substantially similar results. Forexample, in an exemplary embodiment below, it was found that the ADH2promoter of S. bayanus is only 61% homologous, yet achieved strongheterologous expression in S. cerevisiae, similar to the endogenous ADH2promoter. In some cases, a substantially similar sequence may behomologous to the promoter sequences identified herein (e.g., have anucleotide BLAST e value of less than or equal to 10⁻¹⁰, 10⁻²⁰, 10⁻³⁰,10⁻³⁵ or 10⁻⁴⁰).

In FIG. 7A, an exemplary schematic of a section of an exogenous DNAvector (e.g., cloning vector, expression vector, and/or shuttle vector)having a production-phase promoter sequence embedded within. A vector iscapable of transferring nucleic acid sequences to target cells (e.g.,yeast). Typical DNA vectors include, but are not limited to, plasmid orviral constructs. DNA vectors are also meant to include a kit of variouslinear DNA fragments that are to be recombined to form a plasmid orother functional construct, as is common in yeast homologousrecombination methods (See e.g., Z. Shao, H. Zhao & H. Zhao, 2009,Nucleic Acids Research 37:e16, 2009, the disclosure of which isincorporated herein by reference). Often, embodiments of cloning vectorswill incorporate other sequences in addition to the production-phasepromoter. As depicted in FIG. 7A, the exemplary cloning vector has aterminator sequence and cloning/recombination sequence in addition tothe production-phase promoter, each of which can assist with expressionvector construction. Furthermore, other sequences necessary for growthand amplification can be incorporated into the promoter vector.Embodiments of these sequences may include, for example, at least oneappropriate origin of replication, at least one selectable marker,and/or at least one auxotrophic marker. It should be noted, however,that various embodiments of the disclosure are not required to containcloning, terminator, or either sequences. For example, embodiments of atypical shuttle vector may only contain the production-phase promotersequence along with the necessary sequences for amplification in abiological system.

For purposes of this application, an exogenous DNA vector is any DNAvector that was constructed, at least in part, exogenously. Accordingly,DNA vectors that are assembled using the yeast's own cell machinery(e.g., yeast homologous recombination) would still be consideredexogenous if any of the DNA molecules transduced within yeast forrecombination contain exogenous sequence or were produced by a non-hostmethodology, such as, for example, chemical synthesis, PCRamplification, or bacterial amplification.

As shown in FIG. 7B, various embodiments of the disclosure are directedto DNA vectors having multiple production-phase promoters. In thesevarious embodiments, multiple different production-phase promoters areincorporated, preferably each having a unique sequence and derived froma different gene and/or S. sensu stricto species. Having unique promotersequences can prevent complications that can arise during productproduction in yeast, such as, for example, unwanted DNA recombination atsites similar to the promoter sequences that render the DNA vectorconstructs undesirable. In many embodiments, the DNA vector has at least2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 ormore than 20 production-phase promoters. As the size of the DNA vectorincreases, the utility may decrease, as larger vectors may becomeunwieldly for the intended organism to handle. For example, plasmids foramplification in E. coli are often somewhere between 2,000 and 10,000base pairs (bp) but can handle up to 20,000 bp or so. Likewise, plasmidsfor amplification and growth in yeast can vary from approximately 10,000to 30,000 bp. Viral vectors, on the other hand, often have a limitedconstruct size and thus may require a more precise vector size. Thus,depending on vector and intended use, the number of production-phasepromoters within a DNA vector will vary.

Although FIG. 7B depicts recombination sites, cloning sites, andterminator sequences, it should be noted that these sequences may or maynot be included in various embodiments of DNA vectors having multipleproduction-phase promoters. The incorporation of these sequences orother various sequences is often dependent on the purpose of the DNAvector. For example, cloning vectors may not include a terminatorsequence if that sequence is to be incorporated into an expressionconstruct at another stage of assembly.

FIG. 8A depicts an exemplary heterologous expression vector having aproduction-phase promoter for expression in yeast, in accordance withvarious embodiments of the disclosure. Expression constructs contain anexpression cassette that has a promoter, a heterologous gene, and aterminator sequence in order to produce an RNA molecule in anappropriate host. Expression cassette in accordance with numerousembodiments will have a production-phase promoter situated proximatelyupstream of a heterologous gene of which the promoter is to regulateexpression. It should be understood, that the precise location of theproduction-phase promoter upstream of the heterologous gene may vary,but the promoter generally is within a certain proximity to adequatelyfunction.

In many embodiments of the disclosure, a heterologous gene is any genedriven by a production-phase promoter, wherein the heterologous gene isdifferent than the endogenous gene that the promoter regulates withinits endogenous genome. Accordingly, a S. cerevisiae production-phasepromoter could regulate another S. cerevisiae gene provided that thegene to be regulated is not the gene endogenously regulated. Forexample, the S. cerevisiae ADH2 promoter should not regulate the S.cerevisiae ADH2 gene; however, the S. cerevisiae ADH2 promoter canregulate any other S. cerevisiae gene or the ADH2 gene from any otherspecies. Often, in accordance with many embodiments, the heterologousgene is from a different species than the species from which theproduction-promoter sequence was obtained.

Although not depicted, various embodiments of expression cassettes mayinclude other sequences, such as, for example, intron sequences,Kozak-like sequences, and/or protein tag sequences (e.g., 6×-His) thatmay or may not improve expression, production, and/or purification. Inyeast, various embodiments of expression vectors will also minimallyhave a yeast origin of replication (e.g., 2-micron) and an auxotrophicmarker (e.g., URA3) in addition to the expression cassette. Othernonessential sequences may also be included, such as, for example,bacterial origins of replication and/or bacterial selection markers thatwould render the expression capable of amplification in a bacterial hostin addition to a yeast host. Accordingly, various embodiments ofexpression vectors would include the essential sequences forheterologous expression in yeast and other various embodiments wouldinclude additional nonessential sequences.

In accordance with various embodiments, a DNA vector having aproduction-phase promoter expression cassette can be transformed into ayeast cell. Or alternatively, and in accordance with numerousembodiments, a DNA vector having a production-phase promoter expressioncassette can be assembled within yeast using homologous recombinationtechniques. Once existing within a yeast cell, the production-phasepromoter can regulate the expression of a heterologous gene inaccordance with the yeast cell's energy metabolism. As describedpreviously, and in accordance with many embodiments, production-phasepromoters repress heterologous expression when the yeast cell is in ananaerobic energy metabolic state. Alternatively, and in accordance witha number of embodiments, production-phase promoters induce heterologousexpression when the yeast cell is in an aerobic energy metabolic state

Depicted in FIG. 8B are alternative exemplary heterologous expressionvectors having multiple production-phase promoters for expression ofmultiple genes in yeast in accordance with numerous embodiments. In someembodiments, the expression vectors will include at least two expressioncassettes, each with a unique promoter, gene, and terminator sequence inorder to prevent unwanted recombination. The number of expressioncassettes will vary based on vector construct design and application.For heterologous expression in S. cerevisiae, it has been found thatplasmid expression vectors of approximately 30,000 bp are stilltolerated. Thus, vectors containing up to seven production-phasepromoter expression cassettes can be incorporated into an expressionvector and have been found to be able to maintain adequate geneexpression and protein production. Larger vectors with more expressioncassettes may be tolerated.

Although FIG. 8B depicts multiple expression cassettes sequentially inthe same orientation (5′ to 3′), it should be understood that thecombination of two or more expression cassettes is not limited tosequential linear organization in the same orientation. Expressioncassettes in accordance with many embodiments exist within theexpression vector in any orientation and in any sequential order.Furthermore, it should be understood that other sequence elements of anexpression vector (e.g., an auxotrophic marker) may be among and/orbetween the multiple expression cassettes. Optimal vector design islikely to depend on various factors, such as, for example, optimizingthe location of the auxotrophic marker to enable the final expressionvector to include each expression cassette to be incorporated.

DNA heterologous expression vectors are a class of DNA vectors, and thusthe description of general DNA vectors above also applies to theexpression vectors. Accordingly, many embodiments of the expressionvectors are formulated into a plasmid vector, a viral vector, a circularvector, or a kit of linear DNA fragments to be recombined into a plasmidby yeast homologous recombination. In several of these embodiments, theend-product vector contains at least one expression cassette having aproduction-phase promoter. It should be understood, that in addition tothe at least one production-phase promoter, some vector embodimentsincorporate expression cassettes that include other promoters, such as(but not limited to), constitutive promoters that maintain highexpression during the growth and production phases.

The various embodiments of heterologous expression vectors having atleast one production-phase promoter can be used in numerousapplications. For example, high expression in the production phase canlead to better, prolonged expression, as compared to constitutivepromoters. In many applications, the end product is a protein from asingle gene or a protein complex of multiple genes to be purified fromthe culture. For these applications, high, prolonged expression usingproduction-phase promoters can lead to better yields of proteins.Furthermore, when the heterologous protein is toxic to the host yeastcells, the use of production-phase promoters prevents the expression ofthe toxic protein during growth phase, allowing the yeast to reach ahealthy confluency before mass protein production.

The production-phase promoter vectors can also benefit the production ofa biosynthetic compound from a gene cluster. Many products derived fromvarious natural species are produced from a cluster of genes withsequential enzymatic activity. For example, the antibiotic emindole SBis produced from a cluster of four genes that is expressed inAspergillus tubingensis. To reproduce this gene cluster in a yeastproduction model, a production-promoter vector system with fourdifferent expression cassettes could work. This system would allow theyeast to reach a healthy confluency before the energy-drainingexpression of four heterologous proteins begins, leading to betteroverall yields of the antibiotic product. In fact, experimental resultsprovided in an exemplary embodiment described in Example 1 belowdemonstrate that a production-phase promoter vector outperformed aconstitutive promoter vector approximately 2-fold to produce theemindole SB product.

FIG. 9 depicts an exemplary process (Process 400) to implement variousembodiments of production-phase promoters. To begin, Process 400identifies and selects at least one gene for heterologous expression inyeast (401). The choice of gene(s) for expression would depend on thedesired outcome. For example, to produce a biosynthetic compound, onewould likely select to express all, or a subset, of the genes within abiosynthetic gene cluster of a particular organism. Once the gene(s)have been selected, Process 400 then appropriates DNA molecules havingthe coding sequence of the selected genes (403). As is well known in theart, there are many ways to appropriate DNA molecules, which includechemical synthesis, extraction directly from the biological source, oramplification of a gene by polymerase chain reaction (PCR).

Process 400 then uses the appropriated DNA molecules to assemble thesemolecules into an expression vector having production-phase promoters(405). There are many ways to assemble DNA expression vectors that arewell known in the art, which include popular methodologies such ashomologous recombination and restriction digestion with subsequentligation. After assembly, the resultant expression vectors can beexpressed in Saccharomyces yeast to obtain the desired outcome (407).

Yeast Homologous Recombination for Plasmid Construction and DirectPlasmid Sequencing

Each of the one or more expression vectors may contain one or morepromoters suitable for expression of a heterologous gene in a model hostsystem. Each expression vector may contain a single coding sequence ormultiple coding sequences. Multiple coding sequences may be functionallylinked to a single promoter, for example via an internal ribosome entrysite, or may be linked to multiple promoters. The expression vectors mayalso contain additional elements to regulate or increase thetranscriptional activity, for example enhancers, polyA sequences,introns, and posttranscriptional stability elements. The expressionvectors may also contain one or more selectable markers.

To improve high-throughput assembly and characterization of orphanbiosynthetic systems or other systems, an automated DNA assemblypipeline using yeast homologous recombination, (YHR), as its coretechnology was developed. An example of the design strategy forassembling DNA parts for this pipeline is illustrated in FIG. 10A. Inone embodiment a method is provided for assembling synthetic geneclusters with heterologous regulation. DNA polynucleotides coding for aseries of distinct promoters and terminators may be obtained in bulk andused for a variety of different synthetic gene clusters. Once a genecluster of interest is identified the coding sequences are determinedand then all coding sequences are synthesized with a flanking sequence(assembly overhang) on each side. The assembly overhangs, encode eitherfor the flanking promoter and terminator if the gene is small enough tobe ordered as a single piece, or for the adjacent gene fragments forlonger sequences. The length of the flanking sequences may vary, In somecases, the flanking sequences may be about 30 bp, 40 bp, 50 bp, 60 bp,70 bp, 80 bp, 90 bp, 100 bp, 30-100 bp, 30-70 bp, 40-60 bp, 40-80 bp, or45-55 bp in length. Placing assembly overhangs exclusively on the uniquecoding sequence fragments allows for all regulatory cassettes to begenerated in bulk and stockpiled as the same fragments are used in allassemblies. For example, in an assembly involving three or more genes,an auxotrophic marker may be placed between the second terminator andthird terminator while no marker is present on the vector. By providingthe auxotrophic marker and origin of replication on separate fragments,reaction background was significantly reduced. Additional modestincreases in efficiency were observed when the assembly host is lackinga DNA ligase, such as the DNL4 DNA ligase.

In some embodiments, this disclosure provides a system for generating asynthetic gene cluster via homologous recombination. The systemcomprises 1 though N unique promoter sequences, 1 through N uniqueterminator sequences, and 1 through N unique coding sequences. Eachterminator sequence may be linked to the following promoter sequence,for example terminator 1 is linked to promoter 2, terminator 2 is linkedto promoter 3, and so forth till terminator N−1 which is linked topromoter N. In some cases, promoter 1 and terminator N may be attachedto a linear plasmid backbone. Coding sequence 1 is attached to anadditional 30-70 base pair sequence on each end such that a first endportion is identical or homologous to the last 30-70 base pairs ofpromoter 1 and a second end portion is identical or homologous to thefirst 30-70 base pairs of terminator 1. Coding sequence 2 is attached toan additional 30-70 base pair sequence on each end such that a first endportion is identical or homologous to the last 30-70 base pairs ofpromoter 2 and a second end portion is identical or homologous to thefirst 30-70 base pairs of terminator 2. Coding sequence N is alsoattached to an additional 30-70 base pair sequence on each end such thata first end portion is identical or homologous to the last 30-70 basepairs of promoter N and a second end portion is identical or homologousto the first 30-70 base pairs of terminator N. These DNA fragments maybe assembled transforming the 1 through N promoters, terminators andcoding sequences into a yeast cell where they are combined through yeasthomologous recombination, and then isolating a plasmid containing the 1through N promoters, terminators and coding sequences from the yeastcell.

An example of this system is shown in FIG. 10A. In this example N equalsfour. The system comprises four unique promoters (110, 120, 130 and140), four unique terminator sequences (210, 220, 230 and 240), and fourunique coding sequences (310, 320, 330, and 340). Each of the codingsequences is created with an additional 30-70 base pair sequence that ishomologous or identical to the sequence of the preceding promoter and anadditional 30-70 base pair sequence that is homologous or identical tothe sequence of the subsequent terminator. Thus, coding sequence 310 isflanked by sequence 111 which is identical or homologous to at least apart of sequence 110, and sequence 211 which is identical or homologousto an least a part of sequence 210. Coding sequence 320 is flanked bysequence 121 which is identical or homologous to at least a part ofsequence 120, and sequence 221 which is identical or homologous to anleast a part of sequence 220. Coding sequence 330 is flanked by sequence131 which is identical or homologous to at least a part of sequence 130,and sequence 231 which is identical or homologous to an least a part ofsequence 230. Coding sequence 340 is flanked by sequence 141 which isidentical or homologous to at least a part of sequence 140, and sequence241 which is identical or homologous to an least a part of sequence 240.In this example promoter sequence 110 and terminator sequence 240 areattached to the ends of a linearized plasmid backbone, and the DNAfragment comprising terminator 210 and promoter 120 further comprises anauxotrophic marker (400). Terminator 210 is linked to promoter 120,terminator 220 is linked to promoter 130, and terminator 230 is linkedto promoter 140.

As shown in FIG. 10B, once the DNA sequences from FIG. 10A aretransfected into a yeast cell the homologous sequences are paired up andthe fragments are linked together through yeast homologousrecombination. The resultant DNA plasmid is illustrated in FIG. 10C.

Traditionally, for yeast homologous recombination plasmid assemblies,plasmid DNA is isolated from assembly clones and transformed into E.coli in order to obtain sufficiently pure DNA to enable sequencing. Thenecessity of this step arises from the relatively low plasmid yieldsfrom yeast and the large amounts of contaminating genomic DNA in everysample. This disclosure provides a method by which plasmid DNA may besequenced directly out of yeast. This may be achieved by a modifiedplasmid prep in which the majority of contaminating DNA is removed bytreatment with an exonuclease enzyme. Any enzyme with exonucleaseactivity and free of endonuclease activity may be used in this step.Examples of exonuclease enzymes include but are not limited to: LambdaExonuclease, RecJf, Exonuclease III (E. coli), Exonuclease I (E. coli),Exonuclease T, Exonuclease V (RecBCD), Exonuclease VIII, truncated,Exonuclease VII, T5 Exonuclease, and T7 Exonuclease. In some cases, theexonuclease is Exonuclease V. If an exonuclease with activity for singlestranded DNA (ssDNA) and not double stranded DNA (dsDNA) is to be used,then the DNA may first be heated to denature the dsDNA. In some cases,the DNA may be treated with a topoisomerase to relax supercoiledplasmids. In some cases, the DNA may not be treated with atopoisomerase. Once the plasmid DNA has been purified by this methodsequencing libraries can be prepared. FIG. 10D demonstrates the increasein purity observed with the exonuclease treatment. Overall, thispipeline has been applied to the sequencing of >1000 clones. Assembliesof up to 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,26, 27, 28, 29, 30 or more than 30 unique DNA fragments can be achievedwith high efficiency. FIG. 10E shows efficient assembly of 2, 3, 4, 5,6, 8, 10, 12, and 14 DNA fragments. In some cases a strain as describedherein allows DNA assembly via homologous recombination with anefficiency of at least 60%, 70%, 80%, 90%, 100%, 110%, 120%, 130%, 140%,150%, or more than 150% as compared to DNA assembly in BY.

To increase the efficiency of assemblies by yeast homologous repair, therelative efficiencies of BY4743 and BY4743ΔDNL4 were tested; a strain inwhich the DNL4 ligase, involved in non-homologous end joining, has beendeleted. FIG. 11A illustrates efficiencies of several plasmid assembliesdone in both strains, demonstrating the deletion of the DNL4 DNA ligasedoes consistently serve as the more efficient assembly background.

The sequencing of plasmid DNA directly out of yeast as in FIG. 11E (e.g.without transforming into another host such as E. coli as in FIG. 11D)is an advantage of the methods described herein. In establishing thesemethods, multiple means of preparation for both the plasmid DNA and thenext-generation sequencing (NGS) library prep were tested. Shown in FIG.11B is a comparison of sequencing efficiency using DNA prepared bothfrom colonies picked from plates and cell pellets collected from 1 ml ofliquid culture. These data show that these approaches generate samplesof equivalent purity.

Initially, the platform utilized an NGS library preparation in which thepurified, exonuclease treated plasmid DNA was ultrasonically sheared,followed by end repair, A-tailing, and adaptor ligation. In order todecrease labor and increase throughput, a recently publishedmodification of the Illumina NexeraXT transposase based prep wasperformed (M. Baym, et al., PLoS One 10:e01280367, 2015). Ultrasonicshearing necessitated the serial processing of multiple plates of cloneswhile tagmentation allows for parallel processing of multiple plates.FIG. 11C demonstrates that this modified Nextera preparation providesequivalent efficiency as compared to the standard approach.

This approach may be suitable for multiple DNA preparation methods inmultiple strain backgrounds. Additionally, it was shown that thisapproach is compatible with various library preparations for sequencingon Illumina platforms. It is anticipated that this approach could beeasily modified to function in sequencing workflows using alternatesequencing platforms such as those provided by Pacific Bioscience andOxford Nanopore technologies.

Host Cells

The expression vectors may be transfected into a host cell to producesecondary metabolites. The host cell may be any cell capable ofexpressing the coding sequences from the expression vectors. The hostcell may be a cell which can be grown and maintained at a high density.For example the host cell may be one which may be grown and maintainedin a bioreactor or fermenter. The host cell may be a fungal cell, ayeast cell, a plant cell, an insect cell or a mammalian cell.

In some cases the host cell is bacterial. The bacteria may be aProteobacteria such as a Caulobacteria, a phototrophic bacteria, a coldadapted bacteria, a Pseudomonads, or a Halophilic bacteria; anActinobacteria such as Streptomycetes, Norcardia, Mycobacteria, orCoryneform; a Firmicutes bacteria such as a Bacilli, or a lactic acidbacteria. Examples of bacteria which may be used include, but are notlimited to: Caulobacter crescentus, Rodhobacter sphaeroides,Pseudoalteromonas haloplanktis, Shewanella sp. strain Ac10, Pseudomonasfluorescens, Pseudomonas putida, Pseudomonas aeruginosa, Halomonaselongata, Chromohalobacter salexigens, Streptomyces lividans,Streptomyces griseus, Nocardia lactamdurans, Mycobacterium smegmatis,Corynebacterium glutamicum, Corynebacterium ammoniagenes, Brevibacteriumlactofermentum, Bacillus subtilis, Bacillus brevis, Bacillus megaterium,Bacillus licheniformis, Bacillus amyloliquefaciens, Lactococcus lactis,Lactobacillus plantarum, Lactobacillus casei, Lactobacillus reuteri, andLactobacillus gasseri.

In some cases the host cell is a fungal cell. In some cases, the hostcell is a yeast cell. Examples of yeast cells include, but are notlimited to Saccharomyces cerevisiae, Saccharomyces pombe, Candidaalbicans, and Cryptococus neoformans. In some cases the host cell may bea filamentous fungi, such as a mold. Examples of molds include, but arenot limited to Acremonium, Alternaria, Aspergillus, Cladosporium,Fusarium, Mucor, Penicillium, and Rhizopus. In some cases, the host cellmay be an Acremonium cell. In some cases, the host cell may be anAlternaria cell. In some cases, the host cell may be an Aspergilluscell. In some cases, the host cell may be an Cladosporium cell. In somecases, the host cell may be an Fusarium cell. In some cases, the hostcell may be a Mucor cell. In some cases, the host cell may be aPenicillium cell. In some cases, the host cell may be a Rhizopus cell.

In some cases, the host cell is an insect cell. In some cases the hostcell is a mammalian cell. Examples of mammalian cell lines include HeLacells, HEK293 cells, B16 melanoma cells, Chinese hamster ovary cells, orHT1080. In some cases, the host cell is a plant cell. In some cases, thehost cell may be part of a multicellular host organism.

In some cases, the host cell is a genetically engineered cell. The yeaststrain BJ5464 has historically been a workhorse strain for expression ofheterologous proteins. BJ5464 lacks two vacuolar proteases genes (PEP4and PRB1), which makes the strain useful for biochemical studies, owingto reduced protein degradation. However, BJ5464 has several problemsthat limit its utility. It has a high rate of petite cell formation,which results in offspring that cannot respire (grow on ethanol as acarbon source) and cannot express the respiration-induced promoters usedin this project. It is not genetically tractable because it cannotsporulate, and its non-deletion auxotrophic markers prevent facilegenome editing. Finally, BJ5464 is slow growing.

This disclosure includes a new yeast super host based on the BYbackground. BY is a direct descendent of the yeast genome sequencereference strain and contains the complete deletions of auxotrophicmarkers that facilitate genome editing. It is the basis of the barcodeddeletion collection, which has led to a wealth of genetic andchemico-genomic data. However, it also has major problems limiting itsutility. In particular, it has the poorest sporulation frequency andhighest petite frequency of all common lab strains.

The petite phenotype arises due to a defect in aerobic respiration.Petite yeasts are unable to grow on non-fermentable carbon sources (forexample glycerol or ethanol), and form small anaerobic-sized colonieswhen grown in the presence of fermentable carbon sources (for exampleglucose). The phenotype results from mutations in the mitochondrialgenome, loss of mitochondria, or mutations in the host cell genome.

The genes and single nucleotide polymorphisms (SNPs) responsible for thesporulation and respiration defects have been identified (FIG. 12). Thesporulation defect may be repaired by a series of genetic crosses to apreviously repaired strain. The respiration problem caused bymitochondrial genome instability may also be corrected using genomeediting. Genome editing may be performed with any method know in theart, such as the 50:50 method. (J. Horecka and R. W. Davis. Yeast31:103-12, 2014). In some cases, an improved version of Mega 50:50 maybe used in which a double stranded break is introduced into the genomiclocus to be modified, increasing efficiency by several orders ofmagnitude (J. D. Smith, et al., Mol. Syst. Biol. 13:913, 2017).

In some cases, the host cell may be a cell which has been engineered torepair a sporulation defect. For example, the host cell may be a fungalcell with a repaired sporulation defect. In some cases, the host cell isa yeast cell with a repaired sporulation defect. In some cases, the hostcell is a BY yeast cell in which the sporulation defect has beenrepaired, as in FIG. 12.

In some cases, the host cell may be a cell which has been engineered torepair a respiratory defect or a mitochondrial genome instabilitydefect. For example, the host cell may be a fungal cell with a repairedmitochondrial stability defect. In some cases, the host cell is a yeastcell with a repaired mitochondrial stability defect. In some cases, thehost cell is a BY yeast cell in which the mitochondrial genomeinstability defect has been repaired, as in FIG. 12. In some cases, thehost cell may be a cell in which both a sporulation defect and amitochondrial genomic instability defect have been repaired. Using thegenetic crosses and genome engineering methods discussed above and thegenomic repairs outlined in FIG. 12 the BY strain was engineered torepair the mitochondrial genome instability. An unexpected benefit ofrepairing the mitochondrial genome instability defect was that thestrains grew faster on non-fermentable carbon sources, such as ethanol(see FIGS. 13 and 14A). This is commonly the growth condition of choicefor expression of heterologous genes, which are often linked toproduction-phase promoters activated by growth on non-fermentable carbonsources.

In some cases, the host cell may be genetically engineered to lack agene involved in non-homologous end joining. The lack of such a gene mayincrease the efficacy of homologous recombination in such an engineeredcell. The appropriate genes to delete may vary in each host. As anexample, an engineered yeast host cell may lack a ligase such as theDNL4 DNA ligase. An engineered bacterial host cell may lack one or bothof a Ku homodimer and the multifunctional ligase/polymerase/nucleaseLigD. Other genes which may be involved in non-homologous doublestranded break repair, depending on species, include: Mre11, Rad50,Xrs2, Nbs1, DNA-PKcs, Ku70, Ku80, DNA ligase IV, XLF, Artemis, XRCC4,Dn14, Lif1, XLF also known as Cernunnos, Nej1 and Sir2.

DHY strains have utility for expression of heterologous genes forheterologous compound production, as well as ability to performhomologous recombination and DNA assembly. This combination of abilitiesin one strain allows DNA assembly and production of heterologouscompounds in the same strain, whereas previously these two steps wereseparated in previous types of yeast (BY for DNA assembly and BJ5464 forexpression and small molecule production).

These improvements can provide the DHY strain collection with a numberof advantages over previous strains, particularly the BY strains: DHYcan be faster-growing, result in fewer petite colonies(respiration-deficient), be genetically tractable, allow betterexpression from ADH2-like promoters, and allow both DNA assembly andproduction of heterologous products in the same strain (FIGS. 14B and14C).

In some embodiments, a genetically engineered host cell lacks one ormore conditionally essential genes which can be provided by a plasmid orother DNA vector. This allows for the selection of cells which areexpressing the DNA vector. Examples of genes which may be used for thisare auxotrophic genes which are required for biosynthesis of certainmetabolites or genes for resistance to a toxin. Auxotrophic genes areonly required when the specific metabolite they are required for is notpresent in the culture media. Resistance genes are only required whenthe toxin which they provide protection from is present.

Examples of genetically engineered yeast host cells include geneticallyengineered DHY super-host strains. In some cases, strains are based onthe BY4741/BY4742 background (C. B. Brachmann, et al, Yeast, 14:115-32,1998). Strains may also contain any of the following genetic changesfrom the BY background: sporulation repair (MKT1(30G) RME1(INS-308A)TAO3(1493Q)), and mitochondrial genome stability and function repair (CAT5(91M) MIP1(661T) SAL1+HAP1+) (see FIG. 12). It should be noted, aswould be understood in by persons having ordinary skill in the art, thatany and all these genetic changes can be performed in isolation, inpart, or in totality. For example, it is expected that the a singlegenetic change of either MKT1(30G), RME1(INS-308A), or TAO3(1493Q) wouldresult in at least some repair in sporulation activity. Likewise, asingle genetic change of either CA T5(91M) or MIP1(661 T) or restorationof function of SAL1 (SAL1+) or HAP1 (HAP1+) would result in at leastsome increase in mitochondrial genome stability.

In some cases, a strain may be a prototroph. For example, some strainsmay require methionine, arginine or lysine in the media. In some cases,a strain may be a full heterozygote for several markers from which anycombination of markers can be made by tetrad dissection. For example,heterozygous for genes required for synthesis of histidine, leucine,uracil, lysine and methionine, or heterozygous for genes required forsynthesis of histidine, leucine, uracil, lysine and arginine. Someexamples of strains are listed in Table 3.

In some cases, the use of a strain as described herein allows forgreater expression of BGC proteins and/or greater production ofcompounds from the BGCs. In some cases expression of heterologousproteins in a strain described herein is accomplished with an efficiencyof at least 70%, 80%, 90%, 100%, 110%, 120%, 130%, 140%, or 150% ascompared to heterologous protein expression in BJ5464. In some casesproduction of heterologous compounds in a strain described herein isaccomplished with an efficiency of at least 70%, 80%, 90%, 100%, 110%,120%, 130%, 140%, or 150% as compared to heterologous compoundproduction in BJ5464.

TABLE 3 Description of strain genotypes Strain Parent Genotype ReferenceBY4741 S288C MATa his3Δ1 leu2Δ0 met15Δ0 (C. B. Brachmann, et ura3Δ0 al.,1998, cited supra) BY4743 S288C MATa/α his3Δ1/his3Δ1 (C. B. Brachmann,et leu2Δ0/leu2Δ0 LYS2/lys2Δ0 al., 1998, cited supra) met15Δ0/MET15ura3Δ0/ura3Δ0 BJ5464 MATα ura3-52 trp1 leu2-Δ1 his3- (E. W. Jones,Methods Δ200 pep4::HIS3 prb1-Δ1.6R Enzymol. 194: 428-53, 1991) can1 GALBY4743ΔDNL4 S288C MATa/Matα dnl4Δ/dnl4Δ (E. A. Winzeler, et al., Science285: 901-06, 1999) Y800 MATa ade2-1 leu2-Δ98 ura3-52 (N. Burns, et al.,Genes lys2-801 trp1-1 his3-Δ200 [cir0] Dev. 8: 1087-105, 1994) DHY213BY4741 MATa his3Δ1 leu2Δ0 ura3Δ0 n/a met15Δ0 SAL1+ HAP1+ CAT5(91M)MIP1(661T) MKT1(30G) RME1(INS-308A) TAO3(1493Q) JHY693 DHY213 MATahis3Δ1 leu2Δ0 ura3Δ0 n/a met15Δ0 SAL1+ HAP1+ CAT5(91M) MIP1(661T)MKT1(30G) RME1(INS-308A) TAO3(1493Q) prb1Δ pep4Δ JHY651 DHY213 MATαhis3Δ1 leu2Δ0 ura3Δ0 n/a met15Δ0 SAL1+ HAP1+ CAT5(91M) MIP1(661T)MKT1(30G) RME1(INS-308A) TAO3(1493Q) prb1Δ pep4Δ lys2Δ0 JHY692 DHY213MATa his3Δ1 leu2Δ0 ura3Δ0 n/a met15Δ0 SAL1+ HAP1+ CAT5(91M) MIP1(661T)MKT1(30G) RME1(INS-308A) TAO3(1493Q) prb1Δ pep4Δ ADH2p-npgA-ACS1t JHY705DHY213 MATα his3Δ1 leu2Δ0 ura3Δ0 n/a met15Δ0 SAL1+ HAP1+ CAT5(91M)MIP1(661T) MKT1(30G) RME1(INS-308A) TAO3(1493Q) prb1Δ pep4ΔADH2p-CPR-ACS1t lys2Δ0 JHY702 DHY213 MATa/MATα his3Δ1/his3Δ1 n/aleu2Δ0/leu2Δ0 ura3Δ0/ura3Δ0 met15Δ0/met15Δ0 SAL1+/SAL1+ HAP1+/HAP1+CAT5(91M)/CAT5(91M) MIP1(661T)/MIP1(661T) MKT1(30G)/MKT1(30G)RME1(INS-308A)/RME1(INS- 308A) TAO3(1493Q/TAO3(1493Q)) prb1Δ/prb1Δpep4Δ/pep4Δ ADH2p-npgA-ACS1t/ADH2p- CPR-ACS1t met15Δ0/+lys2Δ0/+

Detection and Characterization of Novel Molecules

Once a host cell is expressing the coding sequences of the identifiedgene cluster, a secondary metabolite may be synthesized in the hostcell. The secondary metabolite may be identified by any method known inthe art. In some cases, the secondary metabolite is identified bycomparing a host cell expressing the cluster with a host cell which doesnot express the cluster. This comparison may utilize chromatographymethods to separate different small molecules produced in the cells. Forexample, column chromatography, planar chromatography, thin layerchromatography, gas chromatography, liquid chromatography, supercriticalfluid chromatography, ion exchange chromatography, size exclusionchromatography be done by high performance liquid chromatography (HPLC),mass spectrometry (MS), or by mass spectrometry high performance liquidchromatography (MS-HPLC). Any peaks which appear for the clusterexpressing host cell and not from the control host cell indicate thepresence of a novel chemical. The comparison between the clusterexpressing host cell and the control host cell may comprise a comparisonof a cell extract, a culture media, or an extracted cell lysate.

Compounds Identified

This disclosure also provides sequences of 43 BGCs, and structures ofnovel products produced by a subset of these BGCs.

In one embodiment, this disclosure provides sequences of cryptic BGCswhich encode various products, SEQ ID NOs: 67-483. These BGCs may alsobe reengineered to provide the coding sequences without the endogenousregulatory sequences. In some examples, the coding sequences may bepredicted using known bioinformatics methods, experimental data, orobtained from databases such as default predicted gene coordinates(start, stop, and introns) as deposited in GenBank. Once the codingsequences have been identified the sequences may be isolated and clonedinto one or more expression vectors for expression in a model hostsystem such as S. cerevisiae.

The expression vectors may be plasmids, viruses, linear DNA, bacterialartificial chromosomes or yeast artificial chromosomes. Each of the oneor more expression vectors may contain one or more promoters suitablefor expression of a heterologous gene in a model host system. Eachexpression vector may contain a single coding sequence or multiplecoding sequences. Multiple coding sequences may be functionally linkedto a single promoter, for example via an internal ribosome entry site,or may be linked to multiple promoters. The expression vectors may alsocontain additional elements to regulate or increase the transcriptionalactivity, for example enhancers, polyA sequences, introns, andposttranscriptional stability elements. The expression vectors may alsocontain one or more selectable markers.

The expression vectors may be transfected, or otherwise introduced, intohost cells. Examples of host cells include but are not limited to yeastand bacterial cells. For example a host cell may be a S. cerevisiae cellor an E. coli cell. Incubating the expression vectors in the host cellsallows for the transcription and translation of the coding sequences torecreate the proteins of the gene cluster. These proteins may thenproduce a secondary metabolite which can be isolated from the cells orthe media in which the cells are grown.

In another embodiment this disclosure provides host cell extractscontaining non-host cell derived products. In some cases, these extractsmay be produced by culturing host cells expressing one or more, or all,of the sequences from one of the following groups SEQ ID NOs: 67-76,77-81, 82-91, 92-97, 98-106, 107-111, 112-118, 119-127, 128-135,136-153, 154-157, 158-162, 163-172, 173-181, 182-186, 187-191, 192-199,200-206, 207-211, 212-224, 225-228, 229-235, 236-240, 241-244, 245-255,256-267, 268-276, 277-285, 286-289, 290-293, 294-307, 308-313, 314-318,319-324, 325-329, 330-334, 335-341, 342-350, 351-357, 358-367, 368-372,373-380, 381-388, 389-395, 396-400, 401-406, 407-413, 414-423, 424-427,428-439, 440-447, 448-453, 454-462, 463-471, 472-480, or 481-483. Insome cases, a host cell may express all the sequences from one of thefollowing groups SEQ ID NOs: 67-76, 77-81, 82-91, 92-97, 98-106,107-111, 112-118, 119-127, 128-135, 136-153, 154-157, 158-162, 163-172,173-181, 182-186, 187-191, 192-199, 200-206, 207-211, 212-224, 225-228,229-235, 236-240, 241-244, 245-255, 256-267, 268-276, 277-285, 286-289,290-293, 294-307, 308-313, 314-318, 319-324, 325-329, 330-334, 335-341,342-350, 351-357, 358-367, 368-372, 373-380, 381-388, 389-395, 396-400,401-406, 407-413, 414-423, 424-427, 428-439, 440-447, 448-453, 454-462,463-471, 472-480, or 481-483. In some cases the host cell(s) may expressone or more sequences selected from SEQ ID NOs: 67-483. After culturingthe cells, they may be collected, lysed, and the small molecules may bepurified from nucleic acid, proteins, complex carbohydrates and lipidcontaining fractions. The secondary metabolites produced may also besecreted into the cell media. In this case this disclosure also providesa cell media containing secondary metabolites.

This disclosure also provides compounds isolated from the host cellextracts or media. A compound of this disclosure may be

Compounds of this disclosure may have useful therapeutic applications,for example in treating or preventing a disease or disorder. Compoundsof this disclosure may be used to treat an infection, for example abacterial, fungal or parasitic infection. Compounds of this disclosuremay have antibiotic and/or antifungal activities. Compound 8 andCompound 9 may have antimicrobial, antifungal and/or antibacterialactivities. Compounds of this disclosure may have non-medicalapplications.

In some embodiments this disclosure provides pharmaceutical compositionscomprising of a compound of this disclosure. In some cases apharmaceutical composition contains at least one of Compounds: 6, 7, 8,9, 10, 11, 12, 13, 14, 15, and 16. Compositions as described herein maycomprise a liquid formulation, a solid formulation or a combinationthereof. Non-limiting examples of formulations may include a tablet, acapsule, a gel, a paste, a liquid solution and a cream. The compositionsof the present disclosure may further comprise any number of excipients.Excipients may include any and all solvents, coatings, flavorings,colorings, lubricants, disintegrants, preservatives, sweeteners,binders, diluents, and vehicles (or carriers). Generally, the excipientis compatible with the therapeutic compositions of the presentdisclosure. Generally, the excipient is a pharmaceutically acceptableexcipient. The pharmaceutical composition may also contain minor amountsof non-toxic auxiliary substances such as wetting or emulsifying agents,pH buffering agents, and other substances such as, for example, sodiumacetate, and triethanolamine oleate.

In some embodiments this disclosure provides a method of synthesizing acompound described herein. The method may include steps of providing oneor more coding sequences of SEQ ID NOs: 67-483 in a suitable vector, orvectors, together with regulatory sequences which will drive expressionof the coding sequences in a host cell. The vector, or vectors, are thenprovided to a host cell, such as for example a yeast cell, and the cellsare grown under conditions that allow for the expression of the codingsequences. In some cases, a host cell may be provided with 1, 2, 3, 4,5, 6, or more than 6 different plasmids. The synthesized compound may bepurified from the cell culture by centrifuging the cells to produce acell pellet and a supernatant. The supernatant and cell pellet may beextracted using either ethyl acetate or acetone, or other suitableorganic solvent. For compounds containing carboxylic acid groups, the pHof the supernatant may be adjusted to pH of 4 or less (e.g., 3) with anacid such as HCl prior to extraction. After extraction, both organicphases are combined and evaporated to dryness. The compounds may then bedissolved in the desired solvent and further purified using standardpurification methods.

Although the present disclosure has been described in certain specificaspects, many additional modifications and variations would be apparentto those skilled in the art. In particular, any of the various processesdescribed above can be performed in alternative sequences in order toachieve similar results in a manner that is more appropriate to therequirements of a specific application. It is therefore to be understoodthat embodiments of the present disclosure can be practiced otherwisethan specifically described without departing from the scope and spiritof the present disclosure. Thus, embodiments of the present disclosureshould be considered in all respects as illustrative and notrestrictive.

Example 1: Identification of Production Phase Promoters

Biological data supports the systems and constructs of production-phasepromoter DNA vectors and applications thereof. Provided below areseveral examples of incorporating production-phase promoters into DNAvectors. Some of these vectors were used to produce biosyntheticproducts from multi-gene clusters derived from various fungal species.Compared to a constitutive promoter system, production-phase promotersystems in accordance with various embodiments produced several-foldgreater product.

Production Phase Promoter Expression Analysis

Because the ADH2 promoter (SEQ ID NO. 1) has properties of aproduction-phase promoter, a panel of promoter sequences was compared tothe ADH2 promoter to identify other production-phase promoters. Tobegin, endogenous S. cerevisiae genes were identified that appearedco-regulated with ADH2 in a previous genome-wide transcription study (Z.Xu. et al., Nature 457:1033-37, 2009, the disclosure of which isincorporated herein by reference). In this study, transcription of yeastgenes was quantified during mid-exponential growth in several types ofgrowth media. Of the 5171 ORFs examined, 35 appeared co-regulated withADH2, with co-regulation defined as a greater than two-fold increase inexpression with a non-fermentable carbon source (ethanol in ayeast-peptone-ethanol (YPE) media) as compared to a fermentable carbonsource (dextrose in a yeast-peptone-dextrose (YPD) media). Because thesedata were collected at a single time point and assessed transcription ofgenes in their native context, their ability to co-regulate heterologousgenes in a production-phase promoter system required further validationand characterization.

A detailed characterization of the ability of 34 selected promoters tocontrol expression of heterologous genes was performed. For thisspecific purpose, a promoter was defined as the shorter of (a) 500 bpupstream of the start codon, or (b) the entire 5′ intergenic region.Each promoter was cloned upstream of the gene for monomeric enhanced GFP(eGFP) and integrated each of the resulting cassettes in a single copyat the ho locus of individual strains. Control strains were included inwhich strong constitutive FBA1 and TDH3 promoters were cloned upstreamof eGFP in an identical manner. The 35 promoter sequences can be foundin SEQ ID NOs. 2-35.

In order to compare the 35 putative production-phase promoters, theexpression of eGFP protein was assessed over 72 hours in each strain byflow cytometry in media with both fermentable (YPD) and non-fermentable(YPE) carbon sources (FIGS. 16 and 17). All cultures were started in YPDmedia and analysis of eGFP expression began when cells were in the midstof exponential fermentative growth (OD₆₀₀=0.4, 0 hrs). At this point,cells were either left to continue growth in YPD or spun-down andresuspended in YPE. Consistent with previous work, pADH2 was entirelyrepressed at the point where the experiment commenced (duringexponential fermentative growth, 0 hrs) unlike the constitutivepromoters pTDH3 and pFBA1, which were expressed at near maximum levelsregardless of phase. Moderate expression from pADH2 was observed after afurther 6 hours in YPD culture or following a growth media switch toYPE. Within 24 hrs, expression reached levels exceeding those observedin the strong constitutive systems. Cytometry histograms andfluorescence microscopy demonstrated that within 48 hours, >95% of allcells with pADH2 and pPCK1 driven expression were fluorescing abovebackground (FIG. 18). Protein expression levels spanned 15-50 fold, withmost showing little or no expression until 24 hours into the culture(FIGS. 16 and 17). Transgene expression driven by the PCK1, MLS1, andICL1 promoters (SEQ ID NOs. 2-4) not only showed the same timing ofexpression as pADH2, but also expressed at an equivalently high level.The promoters of genes YLR307C-A, YGR067C, IDP2, ADY2, GAC1, ECM13 andFAT3 (SEQ ID NOs. 5-11) displayed semi-strong transgene expression (FIG.16). In addition, the promoters of genes PUT1, NQM1, SFC1, JEN1, SIP18,ATO2, YIG1, and FBP1 (SEQ ID NOs. 12-19) displayed weak of transgeneexpression (FIGS. 16 and 17). The promoter PHO89 (SEQ ID NO. 20) did notexhibit strong repression in during the growth phase (FIGS. 16, 0 and 6hours). The results of the other sequences are also depicted in FIG. 16(SEQ ID NOs. 22-36). The constitutive promoters pTDH3 and pFBA1 (SEQ IDNOs. 50 and 52) were used as controls (FIGS. 16-18).

The above analysis identified a large set of co-regulated promotersspanning a wide range of expression levels, three of which were asstrong as pADH2. However, a more extensive set of strongproduction-phase promoters is desirable for assembly of constructshaving multi-gene pathways, especially pathways having more than fourgenes. To identify other production-phase promoter candidates, thegenomes of five closely related species within the S. sensu strictocomplex were examined (FIG. 19). The promoter region was identified forthe closest ADH2 gene homolog in the genomes of Saccharomyces bayanus,Saccharomyces paradoxus, Saccharomyces mikitae, Saccharomyceskudriavzevii, and Saccharomyces castellii. Multiple sequence alignmentof the upstream activation sequences (UAS) revealed that nearly allsequences (except that from S. castellii) are highly conserved acrossthis region, suggesting a potential for regulation similar to that of S.cerevisiae ADH2 (FIG. 20, SEQ ID NOs. 36-40). In order to be used forsingle-step pathway assembly, all promoter sequences must besufficiently unique to prevent undesired recombination between eachother. Therefore, the pairwise identities for each of the Saccharomycessensu stricto ADH2 promoter pairs were analyzed (FIG. 21). The mostsimilar promoter to the S. cerevisiae ADH2 promoter is that from S.paradoxus, with 83% identity, including a single 40 bp stretch locatednear the center of the promoter. This homology is significantly lessthan the 50-100 bp typically used for assembly by yeast homologousrecombination, and recombination events between sequences with thislevel of identity occur at very low frequency, suggesting that thesepromoters should be compatible with a multi-gene assembly techniqueutilizing yeast homologous recombination as described above.

As with the endogenous yeast promoter candidates, these other putativeSaccharomyces promoters required detailed characterization of inductionprofiles. DNA encoding each of these promoter sequences was obtained bycommercial synthesis and characterized expression of eGFP from eachpromoter in the same manner as the endogenous yeast promoters (FIGS. 22and 23). Of the five Saccharomyces sensu stricto pADH2s tested (SEQ IDNOs. 36-40), the promoters derived from S. paradoxus, S. kudriavzevii,and S. bayanus show timing and strength of expression equivalent to thatof S. cerevisiae pADH2. In combination with the endogenous yeastpromoters, these three additional Saccharomyces pADH2s expand the numberof strong promoters with the desired induction profile.

Expression of Compound Product Pathways Using the Production-PhasePromoter System

To study the utility of the new promoter set for heterologous expressionof a biosynthetic system, production of fungal-derived dehydrozearalenol(1) and indole-diterpene (2) was examined (FIG. 24, Compounds 1 & 2).The biosynthesis of the indole-diterpene compound resulted from thecoordinated expression of four in Aspergillus tubingensis genes (FIG.25, SEQ ID NOs. 59-62). Two versions of each pathway were constructed:one having all production-phase promoters, and the other having allconstitutive promoters (FIG. 24). The production-phase promoter systemutilized the pADH2 from S. cerevisiae (SEQ ID NO. 1), pADH2 from S.bayanus (SEQ ID NO. 38), and pPCK1 (SEQ ID NO. 2) and pMLS1 (SEQ ID NO.3) from S. cerevisiae. In the constitutive system, transcription wasdriven by four frequently used strong constitutive promoters: pTEF1,pFBA1, pPCK1, and pTPI1 (SEQ ID NOs. 51-54). Each indole-diterpenesystem was constructed on a single plasmid harboring four expressioncassettes: promoter::GGPPS::tADH2; promoter::PT::tPGI1;promoter::FMO::tENO2; and promoter::Cyc::tTEF1; wherein, the promotersequences corresponded to either the production-phase or theconstitutive promoters (FIG. 24). Similar constructs were built for thedehydrozearalenol compound with the two genes HR-PKS and NR-PKS (SEQ IDNOs. 63 and 64). All plasmids were constructed using yeast homologousrecombination. It should be noted that pADH2 sequences from S.cerevisiae and S. bayanus (61% identity) are sufficiently unique forthis type of assembly. The production of compounds 1 and 2 produced byS. cerevisiae BJ5464/npgA/pRS424 transformed with each of these plasmidswere measured over seventy-two hours in YPD batch culture (FIG. 26). An80-fold and 4.5-fold increase in titer of compound 1 and 2 was observedfor the system using the production-phase promoters as compared to theconstitutive system.

Materials and Methods Supporting the Production-Phase PromoterExperiments

General techniques, reagents, and strain information: Restrictionenzymes were purchased from New England Biolabs (NEB, Ipswich, 25 MA).Cloning was performed in E. coli DH5α. PCR steps were performed usingQ5® high-fidelity polymerase (NEB). Yeast dropout media was purchasedfrom MP Biomedicals (Santa Ana, Calif.) and prepared according tomanufacturer specifications. Promoter characterization experiments wereperformed in BY4741 (MATα, his3Δ1 leu2Δ.0 met15Δ.0 ura3Δ0) while allexperiments involving the production of 1 were performed in BJ5464-npgAwhich is BJ5464 (MATaura3-52 his3Δ200 leu2Δ.1 trp1 pep4::HIS3prb1.Δ.1.6R can1 GAL) with two copies of pADH2-npgA integrated at 6elements. All Gibson assemblies were performed as previously describedusing 30 bp assembly overhangs.

Construction and characterization of promoter-eGFP reporter strains: Allpromoters were defined as the shorter of 500 base pairs upstream of agene's start codon or the entire 5′ intergenic region. All promotersfrom S. cerevisiae were amplified from genomic DNA, while ADH2 promotersfrom all Saccharomyces sensu stricto were ordered as gBlocks fromIntegrated DNA Technologies (IDT, Coralville, Iowa). Minimal alterationswere made to promoters from S. kudriavzevii and S. mikitae in order tomeet synthesis specifications. In all constructs, eGFP was cloneddirectly upstream of the terminator from the CYC1 gene (tCYC1). pRS415was digested with Sacl and Sall and a Notl-eGFP-tCYC1 cassette wasinserted by Gibson assembly generating pCH600. Digestion of pCH600 withAccl and Pmll removed the CEN/ARS origin, which was replaced by 500 bpsequences flanking the ho locus using Gibson assembly to yield plasmidpCH600-HOint. Each of the promoters to be analyzed was amplified withappropriate assembly overhangs and inserted into pCH600-HOint digestedwith Nod to generate the pCH601 plasmid series. Digestion of the pCH601plasmid series with Ascl generated linear integration cassettes whichwere transformed into S. cerevisiae BY4741 by the LiAc/PEG method.Correct integration was confirmed by PCR amplification of promoters andSanger sequencing.

For characterization, all strains were initially grown to saturationovernight in 100 μl of YPD media. These cells were then reinoculated atan OD₆₀₀ of 0.1 into 1 ml of fresh YPD and allowed to grow to OD₆₀₀=0.4to reach mid-log phase growth (approximately 6 hrs). 500 μl of eachculture was pelleted by centrifugation and resuspended in YPE broth forYPE data while the remaining 500 μl was used for YPD data. The 0 hourtime point was collected immediately after resuspension. For each timepoint, 10 μl of culture was diluted in 2 ml of DI water and sonicatedfor three short pulses at 35% output on a Branson Sonifier. Expressiondata were collected for 10,000 cells using a FACSCalibur flow cytometer(BD Bioscience) with the FL1 detector. Data were analyzed in R using theflowCore package.

Construction of plasmids to produce compounds in S. cerevisiae: Thesequences for genes assembled on IDT producing plasmids are contained inthe supporting information. Regulatory cassettes of promoters andterminators were fused using overlap extension PCR. All genes andregulatory cassettes were amplified by PCR, ensuring 60 bases ofhomology between all adjacent fragments. 500 ng of each purifiedfragment was combined with 100 ng of pRS425 linearized with Not1 andtransformed into S. cerevisiae BJ5464/npgA. Sixteen clones were pickedfrom each assembly plate and grown to saturation in 5 ml CSM-Leu medium.Plasmids were isolated, transformed into E. coli and purified prior tosequence confirmation using the Illumina MiSeq platform. Detailedplasmid maps for pCHIDT-2.1 and pCHIDT-2c are shown in FIG. 27Aillustrates the primers used and the assembly strategy (SEQ ID NOs. 65and 66).

Examining the productivity of indole diterpene generating systemsPlasmids pCHIDT-2.1 and pCHIDT-2c were transformed into BJ5464/npgA withpRS424 as a source of tryptophan overproduction (see, e.g., FIG. 27B).Triplicates of each strain were inoculated into CSM-Leu/-Trp medium andgrown overnight (OD₆₀₀=2.5-3.0). Each culture was used to inoculate 20ml cultures in YPD medium at an OD₆₀₀=0.2 and incubated with shaking at30° C. for 3 days. Every 24 hrs, 2 mls were sampled from each culture.Supernatants were clarified by centrifugation and extracted with 2 mlethyl acetate (EtOAc). Cell pellets were extracted with 2 ml 50% EtOAcin acetone. 500 μl each of pellet and supernatant extracts were combinedand dried in vacuo. Samples were resuspended in 100 μl HPLC grademethanol and LC-MS analysis was conducted on a Shimadzu LC-MS-2020liquid chromatography mass spectrometer with a Phenomenex Kinetex C18reverse-phase column (1.7 μm, 100 Å, 100 mm×2.1 mm) with a lineargradient of 15% to 95% acetonitrile (v/v) in water (0.1% formic acid)over 10 min followed by 95% acetonitrile for 7 min at a flow rate of 0.3mL/min.

Example 2: Identification of Gene Clusters which Produce Compounds thatInteract with a Target Protein

Thanks to next-generation sequencing, thousands of bacterial and fungalgenomes have been sequenced. These species are known to be rich sourcesof secondary metabolites, for example penicillin, rapamycin, and thestatins. These secondary metabolites are small molecules, enzymaticallysynthesized by the products of one or more genes, often arrayedcontiguously in a “biosynthetic gene cluster”.

This disclosure describes a method for identifying specific biosyntheticgene clusters, where the target of the secondary metabolite is aspecific protein, and expressing that secondary metabolite in a hostorganism.

In certain cases, for example, when a secondary metabolite is being usedas a weapon against other organisms, the secondary metabolite may alsobe toxic to the organism that produces it. In these cases, the producingorganism may defend itself against self-harm in a number of ways: bypumping the secondary metabolite out of the cell; by enzymaticallynegating the secondary metabolite; or by producing an additional versionof the target protein that is less sensitive or insensitive to thesecondary metabolite.

In those cases where the organism produces an additional version of thetarget protein, this “protective” version of the gene is oftencolocalized with the biosynthetic gene cluster. Although different tothe gene that produces the target protein, the protective version shouldmaintain detectable homology to the target protein. This method takesadvantage of this homology to identify those biosynthetic gene clustersthat contain or are adjacent to a protective homolog of the targetprotein.

The input data required for this method are a list of biosynthetic geneclusters (e.g., polyketide synthase clusters, non-ribosomal peptidesynthetase clusters) and a list of target proteins (i.e., proteins whoseactivity are to be modulated with secondary metabolites). Thebiosynthetic clusters may be identified based on the presence of certainprotein domains, for example by the software program antiSMASH. Thetarget proteins may be chosen based on their quantitative likelihood ofbeing drug targets.

The output of this method is a score for each biosynthetic cluster,where clusters with higher scores represent those that are more likelyto produce secondary metabolites targeting specific proteins ofinterest.

A score was constructed for each biosynthetic cluster based on thefollowing factors:

-   -   1. the presence of one or more homologs of a target protein        within or adjacent to the cluster, as determined by a homology        search (e.g., using the tblastn algorithm, with a maximum score        granted when one homolog is found)    -   2. the confidence in homology of the target to genes in a        cluster (e.g., according to the tblastn algorithm, with an        increasing score for lower e-values, and an upper bound        threshold of 1e-30)    -   3. the fraction of the homologous gene that meets a certain        threshold of identity (e.g., with an increasing score for more        identity, and a lower bound threshold of 25% identity)    -   4. the total number of genes homologous to the target protein        present in the entire genome of the organism (e.g., with a        maximum score granted to cases with 2-4 homologs per genome)    -   5. the homology of the gene in or adjacent to the cluster to the        target protein (e.g., using the blastx algorithm, with a maximum        score granted when the gene in the biosynthetic gene cluster's        closest homolog in the target protein's genome is the target        protein itself)    -   6. the phylogenetic relationship of the target protein to the        gene in the cluster (e.g., with an increasing score for homologs        in the gene cluster that clade with the target protein, with        confidence assigned by a bootstrap test or Bayesian inference of        phylogeny, and a lower bound threshold defined as homologs in a        phylogenetic context that appear in a clade with bootstrap value        of 0.7 or Bayesian posterior probability of 0.8)    -   7. the expected number of homologs of the target in or adjacent        to the biosynthetic cluster (e.g., with a greater score the        lower the probability of a homolog of the target being present        in or adjacent to a biosynthetic cluster of a certain size,        given the number of total homologs in the genome, as determined        by a permutation test)    -   8. the likelihood that the target protein is essential for        viability, growth, or other cellular processes in the native        environment, e.g., through evidence that deletion of homologs in        related organisms (such as S. cerevisiae) render the organism        inviable    -   9. synteny of the gene cluster with related species (e.g., with        a maximum score if the entire cluster, including the target        homolog, is conserved across several species)    -   10. the functional class of the target homolog (e.g., with a        greater score if the gene is in a protein complex already known        to be targeted by secondary metabolites)    -   11. the presence of specific promoters adjacent to the target        homolog (e.g., with a greater score when there is a        bidirectional promoter upstream of the target homolog and a        biosynthetic gene)    -   12. the presence of specific regulatory elements in the        biosynthetic gene cluster (e.g., with a greater score when there        is a transcription factor binding site that is shared between        target genes and/or biosynthetic genes in the cluster)    -   13. the presence of target homologs outside the cluster        (including on other chromosomes) that are co-regulated with some        or all of the genes in the biosynthetic cluster (e.g., with a        greater score when biosynthetic gene clusters are co-regulated        with putative target homologs)    -   14. the presence of protein- and DNA-sequence-derived features        within the clusters that have successfully been shown to produce        secondary metabolites (e.g., with a greater score when a gene in        a cluster shares a domain—as determined by a Hidden Markov Model        (HMM)—with a cluster that has produced a secondary metabolite in        one of the host organisms)

The above score was calibrated with reference to a set of “truepositives” (i.e., cases where there are one or more known targets in oradjacent to a biosynthetic gene cluster that produces a small moleculeknown to target that protein).

This algorithm has been programmed in the Python programming languageand has been applied to a set of more than 1,000 fungal genomes (andmore than 10,000 biosynthetic gene clusters) to produce a list ofpotentially relevant biosynthetic clusters.

Expressing Biosynthetic Clusters in a Host Organism

Given the highest scoring biosynthetic clusters as defined by thealgorithm above, DNA was synthesized for each of the genes in thoseclusters. The DNA was cloned into a host organism (e.g., S. cerevisiae,also known as baker's yeast) for expression. The host organismsynthesized the proteins from the gene cluster, which produces asecondary metabolite. Using HPLC and mass spectrometry, the secondarymetabolite-expressing strain can be compared to an unmodified strain andaffirm the presence of a new secondary metabolite.

This method has been successfully applied to the production of severalsecondary metabolites where there is evidence, based on the abovemethod, for what the target protein of the secondary metabolite shouldbe:

-   -   An secondary metabolite derived from a biosynthetic gene cluster        containing a homolog of the human gene SOS1;    -   An secondary metabolite derived from a biosynthetic gene cluster        containing a homolog of the human gene BRSK1; and    -   An secondary metabolite derived from a biosynthetic gene cluster        containing a homolog of the human gene DDX41.

A further example of a gene cluster which produces a product, for whichthere is evidence suggesting the target, is shown in FIG. 15A.

Example 3: Prioritization of Novel Biosynthetic Gene Clusters byPhylogenetic Analysis

Two classes of fungal BGCs; those with either a polyketide synthase(PKS), or an UbiA-type sesquiterpene cyclase (UTC) as their core enzymewere chosen for analysis.

A computational pipeline was developed to prioritize PKS and UTCcontaining BGCs for heterologous expression. 581 sequenced fungalgenomes were analyzed from the publicly available GenBank database ofthe National Center for Biotechnology Information (NCBI, as of July2015). Each genome was analyzed for BGCs using antiSMASH2, identifying3512 BGCs harboring an iterative type 1 PKS (iPKS) and 326 BGCsharboring a UTC homologue. Phylogenetic trees of each of these enzymetypes were generated with identified characterized homologs from theMIBiG database20. BGCs were primarily selected from clades having fewcharacterized members (FIG. 28A, FIG. 29A). The selected BGCs were foundin the genomes of both ascomycetes and basidiomycetes. Basidiomycetesare, in general, more difficult to culture with fewer tools for geneticmanipulation available as compared to ascomycetes. As a result, BGCsfrom basidiomycetes are under-studied, with few PKS-containing clustersdeposited in MIBig, suggesting that these organisms represent areservoir of BGCs capable of producing compounds with interesting newstructures.

The coding sequences of all BGCs were ordered as synthetic constructsaccording to the default predicted gene coordinates (start, stop, andintrons) as deposited in GenBank and clusters are described in Table 4.

Shown in FIG. 28A is a cladogram of the ketosynthase sequences of the3512 iPKS sequences identified in this study. Of these, 28 were selectedand the associated BGC containing the selected iPKS was analyzed usingheterologous expression. Selected BGCs met the following criteria: (a)genetic structure was conserved across 3 or more species, (b) exhibitedcanonical domain architecture, and (c) contained an in-cis or proximalin-trans protein capable of releasing the polyketide from the carrierprotein of the PKS (FIG. 30A). Seven of these clusters were derived fromdistinct clades comprised entirely of sequences from basidiomycetes(FIG. 28A).

The 28 selected PKS clusters were edited according to the methodsdescribed here to form expression vectors suitable for expression of thecluster coding sequences in yeast cells. The host cells were incubatedand analyzed for the presence of novel chemical compounds by HPLC, asdescribed in the methods section below. Of the PKS clusters selectedfrom ascomycetes, 13 produce compounds. The most notable is the PKS1cluster, which only contains an iPKS, a hydrolase, and the genes forthree tailoring enzymes: a Cytochrome p450 (P450), a Flavin-dependentmonooxygenase (FMO), and a Short-chain dehydrogenase/reductase (SDR).

For the study of fungal UTCs, the phylogenetic tree shown in FIG. 29Awas constructed based on the UbiA-type sesquiterpene cyclase, Fma-TC,from the fumagillin biosynthetic pathway. Moreover, the P450, Fma-P450,from the same pathway was shown to be a powerful enzyme catalyzing the 8e oxidation of bergamotene to generate a highly oxygenated product. UTCBGCs spanning the entirety of the cladogram were selected in FIG. 29Awhere a cytochrome P450 was proximal to the UTC gene (FIG. 30B).Ultimately, 13 UTC BGCs from both ascomycetes and basidiomycetes wereselected for analysis.

Screening of strains expressing these clusters by LC/HRMS revealed novelspectral features consistent with oxidized sesquiterpenoids beingproduced by five clusters (FIG. 29A). These results demonstrate that themembrane-bound UTCs represent a general class of terpene cyclase encodedby the genomes of diverse fungi. Several clusters and compounds producedare shown in FIGS. 28B, 28C, 28D, 28E and 28F.

Including both PKS and UTC BGCs, 24 of the 41 clusters producedmeasurable compounds, see Table 4 for a summary of the type, species oforigin, and productivity of the clusters. Gene annotation errorsintroduced by incorrect intron prediction may have contributed to thisfailure rate. Manual inspection of one UTC (TC5) that initially hadyielded no products suggested an incorrect intron prediction at the 5′terminus of the gene. Correction of this intron led to a C-terminalprotein sequence that aligned well with known functional UTCs. Whentested by heterologous expression in a host cell, the version with thecorrected intron produced a compound confirming that incorrect intronprediction is a failure mode in approaches that rely on publiclyavailable gene annotations, (FIG. 29B). These results illustrate theimportance of careful gene curation and the need for improved eukaryoticgene prediction, particularly with sequences from taxa with fewwell-studied members.

The results summarized in Table 4 demonstrate the utility of the methodsherein for the selection of cryptic fungal BGCs. With the toolsdeveloped here, strains were built expressing 41 such clusters with 22(54%) producing detectable levels of products not native to S.cerevisiae. While both basidiomycetes and ascomycetes are known to beprolific producers of bioactive compounds, to date, the bulk of researchon the biosynthesis of fungal natural products has been undertaken inascomycetes. In this study, heterologous expression allowed alarge-scale survey of cryptic fungal BGCs from both ascomycetes andbasidiomycetes, a less studied and more difficult to culture division offungi with fewer tools for genetic manipulation. Using this platform, apanel of new products produced by the selected PKS and UTC clusters wasidentified.

Methods

antiSMASH2 software was applied to 581 public fungal genomes depositedin the Genbank database of the National Center for BiotechnologyInformation (NCBI), to search for type 1 PKS and UbiA-like terpenecyclase gene clusters. This analysis identified 3,512 type 1 PKS geneclusters and 326 UbiA-like terpene gene clusters in 538 fungal genomes.

Phylogenetic analysis of both sequence sets was performed by buildingmultiple sequence alignments of all protein sequences using MAFFT andbuilding phylogenetic trees as shown in FIG. 28A and FIG. 29A usingFastTree 2.

28 of the 3,512 sequenced type 1 PKS gene clusters and 13 of the 326terpene gene clusters were selected for expression in yeast as describedabove.

Construction and Culture of Production Strains:

Production strains were constructed by transforming plasmid DNA isolatedout of E. coli (Qiagen miniprep 27106) into the appropriate expressionhost (JHY692 for PKS containing plasmids, JHY705 for all others) usingthe Frozen-EZ Yeast Transformation II kit (Zymo Research T2001) followedby plating on the appropriate SDC dropout media (CSM-Leu for PKScontaining plasmids, CSM-Ura for all others). For BGCs encoded on atleast two plasmids, three biological replicates for each haploidtransformant were mated on YPD plates and incubated at 30° C. for 4-16hrs prior to streaking for single colonies on CSM-Ura/-Leu and incubatedat 30′C.

Small-scale cultures for analysis were begun by picking three biologicalreplicates of each production strain along with empty vector controlsinto 500 μL of the appropriate SDC dropout medium in a 1 ml deep-wellblock and grown for approximately 24 hrs at 30′C. 50 μL of overnightculture was used to inoculate 500 μL of each of the production media tobe tested in the experiment (generally both YPD and YPEG) in 1 ml deepwell blocks. All blocks were covered with gas-permeable plate seals(Thermo Scientific AB-0718) and incubated at 30′C for 72 hrs withshaking at 1000 rpm. Supernatants were clarified by centrifugation for20 mins at 2800 g and a minimum of 100 μl of clarified supernatant wasstored for future analysis. The remainder of the supernatant wasdiscarded and the cell pellets extracted by mixing with 400 μL of 1:1ethyl acetate:acetone. Cell debris was precipitated by centrifugationfor 20 mins at 2800 g and 200 μL of the extraction solvent pipetted to afresh block and evaporated in a speedvac.

Prior to analysis, all supernatants were passed through a 0.2 m filterplate while all cell pellet extracts were resuspended in 200 μl of HPLCgrade methanol prior to filtering.

Analysis of Small Scale Cultures:

LC-MS analysis was conducted on an Agilent 6545 quantitativetime-of-flight mass spectrometer interfaced to an Agilent 1290 HPLCsystem. The ion source for most analyses was an 73electrosprayionization source (dual-inlet Agilent Jet Stream or “dual AJS”). In someanalyses, an Agilent Multimode Ion Source was also used for atmosphericpressure chemical ionization. The parameters used for both ionizationsources are outlined in Table 5.

The HPLC column for all analyses was a 50 mm×2.1 mm Zorbax RRHD EclipseC18 column with 1.8 μm beads (Agilent, 959757-902). No guard column wasused.

Gradient conditions were isocratic at 95% A from 0 to 0.2 min, with agradient from 95% A to 5% A from 0.2 to 4.2 minutes, followed byisocratic conditions at 5% A from 4.2 to 5.2 minutes, followed by agradient from 5% A to 95% A from 5.2 to 5.2 minutes, followed byisocratic re-equilibration at 95% A from 5.2 to 6 minutes. Forelectrospray analyses, A was 0.1% v/v formic acid in water and B was0.1% v/v formic acid in acetonitrile. For APCI analyses, B wassubstituted by 0.1% v/v formic acid in methanol.

Data analysis by untargeted metabolomics was performed with xcms, usingoptimal parameters determined by IPO25. For PKS containing clusters,automated analyses were set to generate extracted ion chromatograms(EICs) for the top 50 spectral features as defined by both fold-changeand p-value. These EICs were then manually inspected to identify thesubset of automatically identified features that appear specific to theexpressed BGC as defined by presence in each of three biologicalreplicates of the production strain and absence from three biologicalreplicates of a negative control strain (FIG. 31). EICs of all BGCspecific features are illustrated in FIGS. 32-49.

Example 4: Construction of Yeast Strains

In the current example, yeast strains are based on the BY4741/BY4742background, which is in turn based on S288c (C. B. Brachmann, et al.,1998, cited supra). The strains were made in two stages: 1) creation ofa core DHY set with restored sporulation and mitochondrial genomestability and 2) creation of JHY derivatives modified for otherbenefits, which may include protein production. All changes introducedin this study were confirmed by diagnostic PCR and sequencing.

A sporulation-restored strain set was built by crossing BY4710 (C. B.Brachmann, et al., 1998, cited supra) to a haploid derivative of YAD373(A. M. Deutschbauer and R. W. Davis, Nat. Genet. 37:133-40, 2005), aBY-based diploid that contains three QTLs that restore sporulation:MKT1(30G), RME1(INS-308A), and TAO3(1493Q). A spore clone from theresulting diploid was repaired for HAP1, which encodes a zinc-fingertranscription factor localized to mitochondria and the nucleus. HAP1 isimportant for mitochondrial genome stability (see J. R. Matoon, E.Caravajal, and D. Gurthrie Curr. Genet. 17:179-83, 1990) and likely alsoimportant for sporulation. S288c and derivatives contain a Tyl insertionin the 3′ end of HAP1 that inactivates function. The transposon wasexcised using the Delitto Perfetto method (F. Storici and M. A. Resnick,Methods Enzymol. 409:329-45, 2006) and confirmed repaired HAPJ functionbased on transcription of a CYC1p-lacZ reporter (M. Gaisne, et al.,Curr. Genet. 36:195-200, 1999). The sporulation-restored, HAP1-repairedstrain and its auxotrophic and prototrophic derivatives were then usedto create the DHY set of strains that were additionally restored formitochondrial genome stability.

The above sporulation-restored strains were used to repair the poormitochondrial genome stability known to be a problem with S288c and BYderivatives. Mitochondrial genome stability is likely to improve growthand ADH2p-like gene expression under conditions of respiration, and forreducing the frequency of petite cells (slow-growing,respiration-defective cells that cannot grow on non-fermentable carbonsources). For a detailed description of the “mito-repair” method, seeconstruction of JHY650 (J. D. Smith, 2017, cited supra). Briefly, the50:50 genome editing method was used to introduce the wild-type allelesof three genes shown to be important for mitochondrial genome stabilityby QTL analysis³¹. The repaired QTLs are: SAL1⁺ (repair of aframeshift), CAT5(91M) and MIP1(661T). Crosses with prototrophic andauxotrophic strains completed the DHY core set of about a dozensporulation and mitochondrial genome stability restored strains that canbe further modified as needed. DHY213 (see Table 3) is one such strain:it contains the seven desired changes described above, is otherwisecongenic with BY4741, and was used in this study to create derivativesfor the HEx platform (see Table 3).

Marker-free, seamless deletion of the complete PRB1 and PEP4 ORFs wasperformed using the 50:50 method (J. Horecka and R. W. Davis, 2014,cited supra). Integration of a 1609 bp ADH2p-npgA-ACS1t expressioncassette on the chromosome was performed using a similar method used tointegrate DNA segments with the REDI method (J. D. Smith, et al., 2017,cited supra), except that URA3, not FCY1, was used as thecounter-selectable marker. For an integration site, an 1166 bp clusterof three transposon LTRs located centromere-distal to YBR209W onchromosome II was replaced (deletion of chrII 643438 to 644603). Two DNAsegments were simultaneously inserted via homologous recombination atthe integration site that had been cut with SceI to create double strandbreaks. One inserted segment was ADH2p-npgA (1448 bp) PCR amplified froma BJ5464/npgA expression strain (npgA from A. nidulans) (K. K. M. Lee,N. A. Da Silva, and J. T. Kealey, Anal. Biochem., 394:75-80, 2009). ThenpgA 3′ end was repaired to wildtype using a reverse PCR primer thatreplaced the npgA intron included previously with the wildtype npgA 3′sequence. To preclude recombination of the expression cassette with thenative ADH2 locus, the 161 bp ACS1 terminator was used as the second DNAsegment (not ADH2t) and PCR amplified from BY4741. The resulting strain(JHY692) was used in a similar fashion to replace only npgA with the CPRORF (cytochrome P450 reductase, ATEG_05064 from A. terreus). Finally, astrain with both npgA and CPR expression cassettes (JHY702) was createdby mating JHY692 and JHY705.

Example 5: Determination of Chemical Structures

For compound isolation, large-scale fermentation was carried out withthe strains and clusters of Example 3. The yeast strains were firststruck out onto the appropriate SDC dropout agar plates and incubatedfor 48 hrs at 30° C. A colony was then inoculated into 40 mL SDC dropoutmedium and incubated at 28° C. for two days with shaking at 250 rpm.This seed culture was used to inoculate 4 L of YPD medium (1.5% Glucose)and cultured for 3 days at 28° C. and 250 rpm. Supernatants were thenclarified by centrifugation and extracted with equal volume of ethylacetate. Cell pellets were extracted with 1 L of acetone. For compoundscontaining carboxylic acid groups, the pH value of the supernatant wasadjusted to 3 by adding HCl prior to extraction. The organic phases werecombined and evaporated to dryness. The residue was purified byISCO-CombiFlash® Rf 200 (Teledyne Isco, Inc) with a gradient of hexaneand acetone. After analysis by LC-MS, the fractions containing thetarget compounds were combined and further purified by semi-preparativeHPLC using C18 reverse-phase column. The purity of each compound wasconfirmed by LC-MS, and the structure was solved by NMR (FIG. 49-59).

All NMR spectra including ¹H, ³C, COSY, HSQC, HMBC and NOESY spectrawere obtained on Bruker AV500 spectrometer with a 5 mm dual cryoprobe atthe UCLA Molecular Instrumentation Center. The NMR solvents used forthese experiments were purchased from Cambridge Isotope Laboratories,Inc.

TABLE 4 Summary of control and cryptic fungal BGCs examined in thisstudy. Native Locus Cluster ID Type Genbank ID Start End Length Speciesof origin Division Productive? IDT Ctl Aspergillus Ascomycota Ytubingensis DHZ Ctl Hypomyces Ascomycota Y subiculosus PKS1 PKS KV441552530394 546723 16329 Coniothyrium Ascomycota Y sporulosum PKS2 PKSKV441551 60641 83767 23126 Coniothyrium Ascomycota Y sporulosum PKS3 PKSDeposition 9892 Acremonium Ascomycota N pending Sp. KY4917 PKS4 PKSAM270992 578654 603367 24713 Aspergillus Ascomycota Y niger PKS5 PKSCP003009 8730831 8753560 22729 Thielavia Ascomycota N terrestris PKS6PKS ABDF02000052 9685 28459 18774 Trichoderma Ascomycota Y virens PKS7PKS JPJY01000093 2671 30026 27355 Pseudogymnoascus Ascomycota N pannorumPKS8 PKS JOWA01000110 1607857 1643205 35348 Scedosporium Ascomycota Yapiospermum PKS9 PKS KE384750 270098 292755 22657 Metarhizium AscomycotaN anisopliae PKS10 PKS KB445572 91380 115199 23819 CochlioholusAscomycota Y heterostrophus PKS11 PKS JPKB01001000 12749 37136 24387Pseudogymnoascus Ascomycota N pannorum PKS12 PKS JPJU01000852 2327438823 15549 Pseudogymnoascus Ascomycota N pannorum PKS13 PKSJPJR01000396 848 19416 18568 Pseudogymnoascus Ascomycota Y pannorumPKS14 PKS KN847553 300827 326489 25662 Verruconis Ascomycota Y gallopavaPKS15 PKS AWSO01000045 179446 203577 24131 Moniliophthora BasidiomycotaY roreri PKS16 PKS JH687542 431482 465332 33850 PunctulariaBasidiomycota Y strigosozonata PKS17 PKS KN839868 143119 188013 44894Hydnomerulius Basidiomycota Y pinastri PKS18 PKS DS989828 13894491407499 18050 Arthroderma Ascomycota Y gypseum PKS19 PKS KB908593 409171443069 33898 Setosphaeria Ascomycota N turcica PKS20 PKS GL532685 472417845 13121 Pyrenophora Ascomycota Y teres PKS21 PKS AMGW010000021294251 1322017 27766 Cladophialophora Ascomycota N yegresit PKS22 PKSDF933843 225991 254307 28316 Talaromyces Ascomycota Y cellulolyticusPKS23 PKS KE720645 48853 80314 31461 Endocarpon Ascomycota Y pusiliumPKS24 PKS DF933834 523551 558018 34467 Talaromyces Ascomycota Ycellulolyticus PKS25 PKS AWSO01000633 5071 30440 25369 MoniliophthoraBasidiomycota N roreri PKS26 PKS KN817529 130968 152240 21272 HypholomaBasidiomycota N sublateritium PKS27 PKS KB445800 1362248 1403385 41137Ceriporiopsis Basidiomycota N subvermispora PKS28 PKS AWSO01000632 880437857 29053 Moniliophthora Basidiomycota Y roreri TC1 UTC ABDF02000086384012 368868 15144 Trichoderma Ascomycota Y Virens TC2 UTC ABDF0200008337740 49247 11507 Trichoderma Ascomycota N Virens TC3 UTC FQ790293 97750131296 33546 Botryotonia Ascomycota Y cinerea TC4 UTC JH717969 858521869226 10705 Formitiporia Basidiomycota Y mediterranea TC5 UTC KI9254592412992 2432365 19373 Heterobasidion Basidiomycota Y annosum TC6 UTCKB445798 581049 618823 37774 Ge/atoporia Basidiomycota N subvermisporaTC7 UTC JH719450 83620 114270 30650 Dichomitus Basidiomycota N squalensTC8 UTC KU198014 1718438 1746540 28102 Pleurotus Basidiomycota Nostreatus TC9 UTC GL377319 205808 212935 7127 SchizophyllumBasidiomycota Y commune TC10 UTC JH687394 99692 114281 14589 StereumBasidiomycota N hirsutum TC11 UTC JH687396 33983 48346 14363 SternumBasidiomycota N hirsutum TC12 UTC JH719415 280135 300470 20335Dichomitus Basidiomycota N squalens TC13 UTC JH795868 73132 97338 24206Dacryopinax Basidiomycota N primogenitus Total 43 Productive 24

TABLE 5 Ion source parameters used in this study Ion source parameterDual AJS MMI gas temperature 250° C. 350° C. drying gas 12 L/min 7.5L/min nebulizer 10 psig 20 psig sheath gas temp. 400° C. — sheath gasflow 12 L/min — vaporizer — 250° C. capillary voltage 3500 V 1500 V(Vcap) nozzle voltage 1400 V — corona discharge — 4 μA fragmentor 100 V120 V skimmer 50 V 50 V octopole 1 RF Vpp 750 V 750 V charging voltage —1000 V

TABLE 6 Description of promoter sequences SEQ ID NO. Description 1 S.cerevisiae pADH2 2 S. cerevisiae pPCK1 3 S. cerevisiae pMLS1 4 S.cerevisiae pICL1 5 S. cerevisiae pYLR307C-A 6 S. cerevisiae pYGR067C 7S. cerevisiae pIDP2 8 S. cerevisiae pADY2 9 S. cerevisiae pGAC1 10 S.cerevisiae pECM13 11 S. cerevisiae pFAT3 12 S. cerevisiae pPUT1 13 S.cerevisiae pNQM1 14 S. cerevisiae pSFC1 15 S. cerevisiae pJEN1 16 S.cerevisiae pSIP18 17 S. cerevisiae pAT02 18 S. cerevisiae pYIG1 19 S.cerevisiae pFBP1 20 S. cerevisiae PHO89 21 S. cerevisiae CAT2 22 S.cerevisiae CTA1 23 S. cerevisiae ICL2 24 S. cerevisiae ACS1 25 S.cerevisiae PDH1 26 S. cerevisiae REG2 27 S. cerevisiae CIT3 28 S.cerevisiae CFRC1 29 S. cerevisiae RGI2 30 S. cerevisiae PUT4 31 S.cerevisiae NCA3 32 S. cerevisiae STL1 33 S. cerevisiae ALP1 34 S.cerevisiae NDE2 35 S. cerevisiae QNQ1 36 S. paradoxus pADH2 37 S.kudriavzevii pADH2 38 S. bayanus pADH2 39 S. mikitae pADH2 40 S.castellii pADH2 41 S. paradoxus pPCK1 42 S. kudriavzevii pPCK1 43 S.bayanus pPCK1 44 S. paradoxus pMLS1 45 S. kudriavzevii pMLS1 46 S.bayanus pMLS1 47 S. paradoxus pICL1 48 S. kudriavzevii pICL1 49 S.bayanus pICL1 50 S. cerevisiae pTDH3 51 S. cerevisiae pTEF1 52 S.cerevisiae pFBA1 53 S. cerevisiae pPDC1 54 S. cerevisiae pTPI1 55 S.cerevisiae tADH2 56 S. cerevisiae tPGI1 57 S. cerevisiae tENO2 58 S.cerevisiae tTEF1 59 A. tubingensis GGPPS 60 A. tubingensis PT 61 A.tubingensis FMO 62 A. tubingensis Cyc 63 H. subiculosis hpm8 64 H.subiculosis hpm3 65 pCHIDT-2.1 66 pCHIDT-2c

TABLE 7 Description of gene sequences Cluster ID NO. SEQ ID NO.Description AFU3G 67 AFOC1 68 AFOC9 69 AFOC6 70 AFOC5 71 AFOC8 72 AFOC473 AFOC7 74 AFOC5N 75 AFOC2_PKS 76 AFOC3 Afu1g17740 77 A1OC1_TF 78A1OC2_serine_hydrolase 79 A1OC3_aldose_epimerase 80 A1OC4_P450 81Afu1g17740_2_PKS Ca157 82 Ca157_1_SDR 83 Ca157_2_Acyl_CoA_oxidase 84Ca157_3_P450 85 Ca157_4_FMO 86 Ca157_5_PKS 87 Ca157_6_transferase 88Ca157_7_PfpI 89 Ca157_8_hyp 90 Ca157_9_AB_hydrolase 91 Ca157_10_hypCa2032 92 Ca2032_1_3_GHMP_kinase 93 Ca2032_1_PKS 94 Ca2032_3_MT 95Ca2032_4_P450 96 Ca2032_5_Ca_uniporter 97Ca2032_6_GNAT_acetyltransferase KU14 98 KU14_SC3_4774_Cyclase 99KU14_SC3_4776_P450 100 KU14_SC3_4773_polyprenyl_synthetase 101KU14_SC3_4771_AK_reductase 102 KU14_SC3_4770_polyprenyl_synthetase 103KU14_SC3_4768_AK_reductase 104 KU14_SC3_4772_DNA_repair 105KU14_SC3_47775_QacA_drug_transporter 106 KU14_SC3_4769_Hyp KU18 107KU18_SH9_7287_Cyclase 108 KU18_SH9_7288_P450 109 KU18_SH9_7289_P450 110KU18_SH9_7286_P450 111 KU18_SH9_7285_DH KU26 112KU26_ETS81063.1_metallo_hydrolase 113 KU26_ETS81064.1_hyp 114KU26_ETS81065.1_esterase 115 KU26_ETS81066.1_PKS 116 KU26_ETS81067.1_hyp117 KU26_ETS81068.1_SDR 118 KU26_ETS81069.1_SDR KU29 119KU29_BO23_6166_ADH 120 KU29_BO23_6167_aminotransferase_3 121KU29_BO23_6168_fasciclin 122 KU29_BO23_6169_hyp 123KU29_BO23_6170_esterase 124 KU29_BO23_6171_glycoside_hydrolase 125KU29_BO23_6172_hyp 126 KU29_BO23_6173_P450 127 KU29_BO23_6174_PKS KU40128 KU40_KFA56048.1_NmrA_like 129 KU40_KFA56160.1_hyp 130KU40_KFA56046.1_phytanoly-CoA_dioxygenase 131 KU40_KFA56190.1_NmrA_like132 KU40_KFA56035.1_PKS 133 KU40_KFA56227.1_hyp 134 KU40_KFA56040.1_hyp135 KU40_KFA56229.1_alkaline_serine_protease KU41 136KU41_KIL85236.1_phenylalanine_specific_permease 137KU41_KIL85237.1_aldehyde_DH 138 KU41_KIL85238.1_DH 139KU41_KIL85239.1_hyp 140 KU41_KIL85240.1_hyp 141 KU41_KIL85241.1_hyp 142KU41_KIL85242.1_1-amino_cyclopropane-1-carboxylate_oxidase 143KU41_KIL85243.1_hyp 144 KU41_KIL85244.1_PKS 145 KU41_KIL85245.1_hyp 146KU41_KIL85246.1_aromatic_dioxygenase 147 KU41_KIL85247.1_peptidase 148KU41_KIL85248.1_kinase 149 KU41_KIL85249.1_metalloprotease 150KU41_KIL85250.1_BLA 151 KU41_KIL85251.1_SDR 152KU41_KIL85252.1_AK_reductase 153 KU41_KIL85253.1_metallopeptidase KU44154 KU44_SL06_4460_TPR_repeat 155 KU44_SL06_4461_PKS 156KU44_SL06_4462_aminotransferase_V 157 KU44_SL06_4463_MFS_transporterPKS1 158 CS100GC1_CDS1_PKS 159 CS100GC1_CDS2_serine_hydrolase 160CS100GC1_CDS3_p450 161 CS100GC1_CDS4_short_chain_dehydrogenase 162CS100GC1_CDS5_FAD_dehydrogenase PKS10 163 KU34_CH10_1770_epimerase_DH164 KU34_CH10_1802_P450 165 KU34_CH10_1821_PKS 166KU34_CH10_1888_metallo_bla 167 KU34_CH10_1909_hyp 168KU34_CH10_1922_DUF1772 169 KU34_CH10_1937_NTF2_like 170KU34_CH10_1951_NmrA_like 171 KU34_CH10_1975_SDR 172KU34_CH10_2002_ABC_transporter PKS11 173 KU35_KFY73936.1_SDR 174KU35_KFY73937.1_DUF_3425 175 KU35_KFY73938.1_Zn_finger 176KU35_KFY73939.1_OMT 177 KU35_KFY73940.1_metallo_lactamase 178KU35_KFY73941.1_PKS 179 KU35_KFY73942.1_acetyl_transferase 180KU35_KFY73943.1_NmrA_like 181 KU35_KFY73944.1_FAD_linked_oxygenase PKS12182 KU36_KFY14209.1_hyp 183 KU36_KFY14210.1_OMT 184KU36_KFY14211.1_Metallo_BLA 185 KU36_KFY14212.1_PKS 186KU36_KFY14213.1_FAD_linked_oxidase PKS13 187 KU37_KFY01907.1_DH 188KU37_KFY01908.1_ABC_tranporter 189 KU37_KFY01909.1_PKS 190KU37_KFY01910.1_MFS 191 KU37_KFY01911.1_PKc_like PKS14 192KU38_KIW01747.1_metallo_hydrolase 193 KU38_KIW01748.1_halogenase 194KU38_KIW01749.1_P450 195 KU38_KIW01750.1_cupin_like 196KU38_KIW01751.1_OMT 197 KU38_KIW01752.1_MHR_TF 198KU38_KIW01753.1_monoxygenase 199 KU38_KIW01754.1_PKS PKS15 200KU39_ESK96608.1_monocarboxylate_permease 201KU39_ESK96609.1_NAD_epimerase_DH 202 KU39_ESK96610.1_hyp 203KU39_ESK96611.1_carbonylreductase 204KU39_ESK96612.1_phenol_2_monoxygenase 205 KU39_ESK96613.1_PKS 206KU39_ESK96614.1_SDR PKS16 207NW_006767437_—_cystathionine_beta-synthase_CDS 208NW_006767437_—_hypothetical_protein_1_CDS 209NW_006767437_—_hypothetical_protein_2_CDS 210NW_006767437_—_hypothetical_protein_3_CDS 211NW_006767437_—_Pkinase-domain-containing_protein_CDS PKS17 212KU43_KIJ60838.1 213 KU43_KIJ60843.1_P450 214 KU43_KIJ60845.1 215KU43_KIJ60839.1_isoprenylcysteine_carboxyl_methyltransferase 216KU43_KIJ60847.1_isoprenylcysteine_carboxyl_methyltransferase 217KU43_KIJ60848.1_ABC_transporter 218 KU43_KIJ60844.1 219KU43_KIJ60846.1_P450 220 KU43_KIJ60842.1 221KU43_KIJ60840.1_ABC_transporter 222 KU43_KIJ60837.1_P450 223KU43_KIJ60841.1_P450 224 KU43_KIJ60886.1_PKS PKS18 225SU62_EFR04826.1_hypothetical_protein 226SU62_EFR04827.1_fatty-acid-CoA_ligase 227 SU62_EFR04828.1_pks 228SU62_EFR04829.1_esterase PKS19 229 SU64_EOA86426.1_YfhR_like 230SU64_EOA86427.1_ER 231 SU64_EOA86425.1_FAD-hydroxylase 232SU64_EOA86421.1_Drug_resistance_transporter 233 SU64_EOA86423.1_OMT 234SU64_EOA86422.1_pks 235 SU64_EOA86424.1_esterase PKS2 236CS163GC1_CDS1_serine_hydrolase 237 CS163GC1_CDS2_p450 238CS163GC1_CDS3_PKS 239 CS163GC1_CDS4_short_chain_dehydrogenase 240CS163GC1_CDS_FAD_dehydrogenase PKS20 241 SU65_EFQ95559.1_BLA 242SU65_EFQ95560.1_PKS 243 SU65_EFQ95561.1_hyp 244 SU65_EFQ95562.1_KR PKS21245 SU67_EXJ61964.1_drug_resistance_transporter 246SU67_EXJ61965.1_scytalone_dehydratase 247SU67_EXJ61966.1_versicolorin_reductase 248 SU67_EXJ61967.1_AMP_ligase249 SU67_EXJ61968.1_hypothetical_protein 250SU67_EXJ61969.1_FAD_monooxygenase 251 SU67_EXJ61970.1_pks 252SU67_EXJ61971.1_metallo_BLA 253 SU67_EXJ61972.1_hyp 254SU67_EXJ61973.1_SDR 255 SU67_EXJ61974.1_AifR_reg PKS22 256SU68_GAM43180.1_beta-lactamase_family_protein 257SU68_GAM43181.1_NAD-dependent_epimerase_dehydratase 258SU68_GAM43183.1_oxidoreductase 259SU68_GAM43185.1_benzoate_4-monooxygenase_cytochrome_P450 260SU68_GAM43187.1_scytalone_dehydratase 261SU68_GAM43176.1_riboflavin_biosynthesis_protein 262SU68_GAM43177.1_sugar_transport_protein 263SU68_GAM43178.1_short-chain_dehydrogenase 264 SU68_GAM43179.1_pks 265SU68_GAM43182.1_halogenase 266SU68_GAM43184.1_NAD-dependent_epimerase_dehydratase 267SU68_GAM43186.1_SDR PKS23 268 SU70_ERF77218.1_SDR 269SU70_ERF77219.1_Fungal_TF 270 SU70_ERF77220.1_p450 271SU70_ERF77221.1_pks 272 SU70_ERF77222.1_DABB_superfamily 273SU70_ERF77223.1_FAD_monooxygenase 274 SU70_ERF77224.1_alcohol_DH 275SU70_ERF77225.1_OMT 276 SU70_ERF77226.1_alcohol_DH PKS24 277SU71_GAM40295.1_sulfhydrolase 278 SU71_GAM40298.1_AMP_ligase 279SU71_GAM40299.1_MT 280 SU71_GAM40301.1_carnosine_synthase 281SU71_GAM40303.1_carnitine_acetyl-CoA_transferase 282SU71_GAM40296.1_aminotransferase 283 SU71_GAM40297.1_ammonia_lyase 284SU71_GAM40300.1_pks 285 SU71_GAM40302.1_ABC_tranporter PKS25 286SU72_ESK88623.1_pks 287SU72_ESK88624.1_carotenoid_cleavage_dioxygenase_1 288SU72_ESK88625.1_long-chain-fatty-acid-ligase 289SU72_ESK88626.1_amino_acid_permease PKS26 290 SU73_KN817529.1_pks 291SU73_KN817529.1_hyp 292 SU73_KN817529.1_AMP_ligase 293SU73_KN817529.1_8_amino_7_oxanoate_synthase PKS27 294SU74_EMD35673.1_NAD_DH 295 SU74_EMD35676.1_sulfhydrylase 296SU74_EMD35664.1_hyp 297 SU74_EMD35669.1_halogenase 298SU74_EMD35663.1_AB_hydrolase 299 SU74_EMD35666.1_alcohol_DH 300SU74_EMD35667.1_Drug_resistance_transporter 301 SU74_EMD35668.1_hyp 302SU74_EMD35670.1_P450 303 SU74_EMD35665.1_hyp 304 SU74_EMD35671.1_pks 305SU74_EMD35672.1_hyp 306 SU74_EMD35674.1_hyp 307 SU74_EMD35675.1_hypPKS28 308 SU75_ESK88629.1_pks 309SU75_ESK88630.1_drug_resistance_subfamily 310SU75_ESK88631.1_hypothetical_protein 311SU75_ESK88632.1_dead-box_protein_abstrakt 312 SU75_ESK88633.1_hyp 313SU75_ESK88634.1_nadh-ubiquinone_oxidoreductase PKS3 314AK_C24701GC76_CDS1_class_II_aminotransferase 315 AK_C24701GC76_CDS2_p450316 AK_C24701GC76_CDS3_PKS 317AK_C24701GC76_CDS4_ferric_chelate_reductase 318AK_C24701GC76_CDS5_DUF4243 PKS4 319 KU27_AN22_1464_esterase 320KU27_AN22_1465_PKS 321 KU27_AN22_1466_hyp 322 KU27_AN22_1467_A_TD 323KU27_AN22_1468_hyp 324 KU27_AN22_1469_amino_oxidase PKS5 325KU28_TT08_2721_AK_reductase 326 KU28_TT08_2722_ABC_transporter 327KU28_TT08_2723_esterase 328 KU28_TT08_2724_PKS 329KU28_TT08_2725_sugar_transport PKS6 330 KU30_TV43_5580_esterase 331KU30_TV43_5581_P450 332 KU30_TV43_5582_PKS 333 KU30_TV43_5583_OMT 334KU30_TV43_5584_P450 PKS7 335 KU31_KFY69032.1_P450 336KU31_KFY69033.1_hyp 337 KU31_KFY69034.1_esterase 338KU31_KFY69035.1_P450 339 KU31_KFY69036.1_PKS 340KU31_KFY69037.1_glycoside_hydrolase 341KU31_KFY69038.1_thymine_dioxygenase PKS8 342KU32_KEZ41287.1_nucleotidyltransferase 343 KU32_KEZ41288.1_OMT 344KU32_KEZ41289.1_AB_hydrolase 345 KU32_KEZ41290.1_mito_phos_carrier 346KU32_KEZ41291.1_hyp 347 KU32_KEZ41292.1_crotonyl_CoA_reductase 348KU32_KEZ41293.1_PKS 349 KU32_KEZ41294.1_Drug_resistance_transporter 350KU32_KEZ41295.1_NADB_monoxygenase PKS9 351KU33_KJK75348.1_alkaline_serine_hydrolase 352KU33_KJK75349.1_LysM_containing_protein 353 KU33_KJK75350.1_PKS 354KU33_KJK75351.1_drug_resistance_transporter 355KU33_KJK75352.1_phytanoyl_CoA_dioxygenase 356 KU33_KJK75353.1_SDR 357KU33_KJK75354.1_amino-7_oxonolate_synthase SU61 358SU61_ENH82084.1_choline_oxidase 359SU61_ENH82085.1_fungal_specific_transcription_factor 360SU61_ENH82086.1_short-chain_dehydrogenase 361SU61_ENH82087.1_hypothetical_protein 362SU61_ENH82088.1_hypothetical_protein 363 SU61_ENH82089.1_pks 364SU61_ENH82090.1_esterase 365SU61_ENH82091.1_ABC_multidrug_transporter_mdr1 366SU61_ENH82092.1_cytochrome_b5_type_b 367SU61_ENH82093.1_sulfite_reductase_subunit_alpha SU63 368SU63_KJZ74253.1_hyp_signalling_protein 369SU63_KJZ74254.1_mtRNA_formyl_transferase 370 SU63_KJZ74255.1_esterase371 SU63_KJZ74256.1_PKS 372 SU63_KJZ74257.1_choline_DH_halogenase SU66373 SU66_EMF17384.1_ras-domain-containing_protein 374SU66_EMF17385.1_phosphoinositide_phosphatase 375SU66_EMF17387.1_hypothetical_protein 376SU66_EMF17389.1_versicolorin_reductase 377SU66_EMF17390.1_scytalone_dehydratase 378 SU66_EMF17386.1_pks 379SU66_EMF17388.1_Metallo-hydrolase_oxidoreductase 380SU66_EMF17391.1_StcQ-like_protein SU69 381SU69_KFY04761.1_aldehyde_oxidase 382 SU69_KFY04762.1_SDR 383SU69_KFY04763.1_hyp 384 SU69_KFY04764.1_reg_protein 385SU69_KFY04765.1_OMT 386 SU69_KFY04766.1_metallo_BLA 387SU69_KFY04767.1_pks 388 SU69_KFY04768.1_FAD-linked_oxidase TC1 389Tv86_130_CDS 390 Tv86_132_CDS 391 Tv86_133_CDS 392 Tv86_134_CDS 393Tv86_135_ORF 394 Tv86_136_CDS 395 Tv86_137_CDS TC10 396KU19_SH16_10821_Cyclase 397 KU19_SH16_10820_P450 398KU19_SH16_10822_P450 399 KU19_SH16_10823_AK_reductase 400KU19_SH16_10819_QacA_MFS TC11 401 KU_20_SH18_11663_Cyclase 402KU_20_SH18_11664_P450 403 KU_20_SH18_11665_P450 404KU_20_SH18_11662_P450 405 KU_20_SH18_11660_NMT 406 KU_20_SH18_11661_MFSTC12 407 KU21_DS19_7112_Cyclase 408 KU21_DS19_7113_P450 409KU21_DS19_7108_P450 410 KU21_DS19_7109_MT 411 KU21_DS19_7111_GMC_oxido412 KU21_DS19_7114_AK_reductase 413 KU21_DS19_7110_Hyp TC13 414KU22_DDJM14_6568_Cyclase 415 KU22_DDJM14_6570_P450 416KU22_DDJM14_6567_OMT 417 KU22_DDJM14_6565_MT 418 KU22_DDJM14_6564_MT 419KU22_DDJM14_6562_P450 420 KU22_DDJM14_6563_MT 421 KU22_DDJM14_6561_GST422 KU22_DDJM14_6569_V-type_ATPase 423 KU22_DDJM14_6566_TF TC2 424Tv83_13_CDS 425 Tv83_14_CDS 426 Tv83_15_CDS 427 Tv83_16_CDS TC3 428BFT4_1_FAD_binding_domain_protein 429BFT4_2_Aldo_keto_reductase_oxidoreductase 430 BFT4_3_cytochrome_P450 431BFT4_4_UbiA_cyclase 432 BFT4_6_hyp_protein_4 433 BFT4_7_hyp_protein_5434 BFT4_8_hyp_protein_6 435 BFT4_10_FAD_FMN_isoamyl_alcohol_oxidase 436BFT4_11_hyp_protein_8 437 BFT4_14_hyp_protein_11 438BFT4_15_glutathione_S_transferase 439BFT4_16_D_isomer_specific_2_hydroxyacid_dehydrogenase TC4 440KU11_FM3_3034_Cyclase 441 KU11_FM3_3033_P450 442 KU11_FM3_3032_P450 443KU11_FM3_3031_FAD 444 KU11_FM3_3037_Hydrox 445 KU11_FM3_3027_AMP_SDR 446KU11_FM3_3030_PHOS 447 KU11_FM3_3035_Hyp TC5 448 KU12_HI6_11661_Cyclase449 KU12_HI6_11655_P450 450 KU12_HI6_11638_P450 451KU12_HI6_11667_AMP_ligase 452 KU12_HI6_11646_monocarbox_MFS 453KU12_HI6_11632_MFS TC6 454 KU13_CeS8_5906_Cyclase 455KU13_CeS8_5905_P450 456 KU13_CeS8_5911_P450 457KU13_CeS8_5904_choline_DH 458 KU13_CeS8_5910_halogenase 459KU13_CeS8_5907_AMP_ligase 460 KU13_CeS8_5909_superoxide_dismutase 461KU13_CeS8_5908_MFS 462 KU13_CeS8_5912_MFS TC7 463KU15_DS54_10337_Cyclase 464 KU15_DS54_10336_P450 465KU15_DS54_10340_P450 466 KU15_DS54_10341_P450 467KU15_DS54_10338_polyprenyl_synthetase 468 KU15_DS54_10333_GMC_oxido 469KU15_DS54_10334_GMC_oxido 470 KU15_DS54_10339_Hyp 471KU15_DS54_10335_Hyp_transmembrane TC8 472 KU16_PO11_11845_Cyclase 473KU16_PO11_11844_FAD_oxidase 474 KU16_PO11_11843_P450 475KU16_PO11_11841_2_P450 476 KU16_PO11_11840_FAD_oxidase 477KU16_PO11_11847_FAD_oxidase 478 KU16_PO11_11846_SDR 479KU16_PO11_11848_AMP_ligase 480 KU16_PO11_11839_AMP_ligase TC9 481KU17_SC18_16687_Cyclase 482 KU17_SC18_16686_P450 483 KU17_SC18_16688_SDR

While preferred embodiments of the present disclosure have been shownand described herein, those skilled in the art will understand that suchembodiments are provided by way of example only. Numerous variations,changes, and substitutions will now occur to those skilled in the artwithout departing from the disclosure.

1.-9. (canceled)
 10. A method for producing a small molecule formodulating a first target protein, the method comprising: selecting,from a database comprising a list of biosynthetic gene clusters, one ormore gene clusters that include or are positioned proximal to a regionthat encodes a protein that is identical with or homologous to the firsttarget protein, expressing the gene cluster or a plurality of genes fromthe gene cluster in a host cell; and isolating a compound produced bythe gene cluster.
 11. The method of claim 10, wherein the one or moregene clusters are selected from the group consisting of (1) clustersthat comprise one or more polyketide synthases and (2) clusters thatcomprise one or more non-ribosomal peptide synthetases, (3) clustersthat comprise one or more terpene synthases, (4) clusters that comprisesone or more UbiA-type terpene cyclases, and (5) clusters that compriseone or more dimethylallyl transferases.
 12. The method of claim 10,wherein the protein that is encoded by the region that is included in orpositioned proximal to the biosynthetic gene cluster is identical to orhas greater than 30% homology to the first target protein.
 13. Themethod of claim 10, wherein the region that encodes the protein that isidentical with or homologous to the first target protein is within20,000 base pairs of a region of a portion of the gene cluster thatencodes a polyketide synthase, a non-ribosomal peptide synthetase, aterpene synthetase, a UbiA-type terpene cyclase, or a dimethylallyltransferase.
 14. (canceled)
 15. The method of claim 10, whereinselecting the one or more gene clusters comprises operating a computer,wherein operation of the computer comprises running an algorithm thattakes into account both an input sequence for the first target proteinand sequence information from a database that includes sequenceinformation from a plurality of species such that the computer returnsinformation corresponding to one or more gene clusters.
 16. The methodof claim 15, wherein the algorithm takes into account the phylogeneticrelationship between gene clusters in the database.
 17. The method ofclaim 10, wherein the one or more gene clusters include a codingsequence for a protein that is an extracellular protein, a membranetethered protein, a protein involved in a transport or secretionpathway, a protein homologous to a protein involved in a transport orsecretion pathway, a protein with a peptide targeting signal, a proteinwith a terminal sequence with homology to a targeting signal, an enzymethat degrades small molecules, or a protein with homology to an enzymethat degrades small molecules.
 18. (canceled)
 19. The method of claim18, further comprising screening the isolated compound for modulation ofan activity of the first target protein.
 20. The method of claim 18,wherein the cluster-encoded protein that is homologous to the firsttarget protein is resistant to modulation by the isolated compound whencompared to modulation of the first target protein.
 21. The method ofclaim 20, wherein the compound is not toxic to the species from whichthe cluster originates due to one or more of (1) sequence differencesbetween the first target protein and the cluster-encoded protein, (2)spatial separation of the compound from the cluster-encoded protein and(3) high expression levels for the cluster-encoded protein. 22.-33.(canceled)
 33. The method of claim 10, wherein the gene cluster is agene cluster of a non-yeast fungus.
 34. (canceled)
 35. The method ofclaim 10, wherein the first target protein is a human protein. 36.-37.(canceled)
 38. A system for identifying one or more biosynthetic geneclusters for introduction into a host organism to produce one or morecompounds that modulate a specific target protein, the systemcomprising: a processor; a memory containing a gene clusteridentification application; wherein the gene cluster identificationapplication directs the processor to: load data describing at least onetarget protein into the memory; load data describing a plurality ofbiosynthetic gene clusters into the memory; score each of the pluralityof biosynthetic gene clusters based upon: performing a homolog searchfor each biosynthetic gene cluster to determine a presence of at leastone homolog of a target protein within or adjacent the biosynthetic genecluster; confidence of homology of the at least one target protein to atleast one gene in a biosynthetic gene cluster; a fraction of ahomologous gene that meets an identity threshold; a total number ofgenes homologous to the at least one target protein present in theentire genome of an organism; homology of the at least one homolog of atleast one target protein within or adjacent the biosynthetic genecluster to genes in the target protein's genome; phylogeneticrelationship of the at least one target protein to a gene in a cluster;expected number of homologs of the at least one target protein in oradjacent to a biosynthetic cluster; or a likelihood that at least onetarget protein is essential for cellular process in the naturalenvironment; and output a report identifying one or more biosyntheticgene clusters that are most likely to produce a compound that modulatesthe at least one target protein. 39.-40. (canceled)
 41. A method forproducing a compound that binds a protein of interest, the methodcomprising: obtaining sequence information for a plurality of contiguoussequences, wherein each contiguous sequence includes a biosynthetic genecluster and flanking genomic sequences; analyzing the contiguoussequences for the presence of a gene that encodes a protein withhomology to the protein of interest, and selecting a biosynthetic genecluster which includes, or is proximal to, a gene that encodes a proteinthat is homologous to the protein of interest, expressing thebiosynthetic gene cluster or a plurality of genes from the biosyntheticgene cluster in a host cell; and isolating a compound produced by thebiosynthetic gene cluster.
 42. The method of claim 41, wherein thecontiguous nucleotide sequence is less than 40,000 base pairs in length.43.-76. (canceled)
 77. The method of claim 10, wherein the first targetprotein is a mammalian protein.
 78. The method of claim 18, wherein saidhost cell is a yeast cell.
 79. The method of claim 18, wherein eachexpressed gene of the plurality of genes from the gene cluster isexpressed under the control of a different promoter.